Regular expressions, or regex, can be very difficult to understand, especially for more complex patterns. I recently encountered a difficult regex problem, which required getting my hands dirty with regex. The regex work led to both understanding the issue as well as coming up with a solution.
I have a string column in a table which contains LLDD DLL at the end, where L refers to a letter and D refers to a digit. This is a fairly simple pattern which can be captured by the following regex -
val pattern_LLDDDLL = "[a-zA-Z]{2}[0-9]{2}\s[0-9][a-zA-Z]{2}$"
But as it turns out, the pattern could also be L LDDDL L, LL DD DLL, LLD DDLL, etc. due to user input error.
we are now looking for zero or 1 space patterns at any location. First, I had to check which of these inaccuracies is more common. I wrote the following transform udf to collect and analyze all the patterns at the end of the string. I limited the patterns upto 3 spaces only.
def strTransform = udf( (str: String) =>
if(str.takeRight(8).startsWith(" ") && str.takeRight(8).count(_ != " ") == 7)
//no space
str.takeRight(7)
else if(str.takeRight(9).startsWith(" ") && str.takeRight(8).count(_ != " ") == 7)
//1 space
str.takeRight(8)
else if(str.takeRight(10).startsWith(" ") && str.takeRight(8).count(_ != " ") == 7)
//2 spaces
str.takeRight(9)
else if(str.takeRight(11).startsWith(" ") && str.takeRight(8).count(_ != " ") == 7)
//3 spaces
str.takeRight(10)
else
""
)
Applying this transformation on my string and analyzing the patterns
originalDf.withColumn("last7", strTransform($"originalColumn"))
.withColumn("last7_wo_space", regexp_replace($"last7", " ", ""))
.filter($"last7_wo_space".rlike(pattern_LLDDDLL))
.withColumn("pattern", regexp_replace($"last7_wo_space", "[A-Za-z]", "L"))
.withColumn("pattern", regexp_replace($"pattern", "[0-9]", "D"))
Now I could count the number of times an inaccurate pattern appears in the data and focus on the most commonly occuring errors. This is the best I could come up with to analyze and fix the patterns at a time with no AI help. I am sure better ways must exist.