Regex and Beyond

Dec. 15, 2022

Regular expressions, or regex, can be very difficult to understand, especially for more complex patterns. I recently encountered a difficult regex problem, which required getting my hands dirty with regex. The regex work led to both understanding the issue as well as coming up with a solution.

I have a string column in a table which contains LLDD DLL at the end, where L refers to a letter and D refers to a digit. This is a fairly simple pattern which can be captured by the following regex -

val pattern_LLDDDLL = "[a-zA-Z]{2}[0-9]{2}\s[0-9][a-zA-Z]{2}$"

But as it turns out, the pattern could also be L LDDDL L, LL DD DLL, LLD DDLL, etc. due to user input error.

we are now looking for zero or 1 space patterns at any location. First, I had to check which of these inaccuracies is more common. I wrote the following transform udf to collect and analyze all the patterns at the end of the string. I limited the patterns upto 3 spaces only.

def strTransform = udf( (str: String) =>

    if(str.takeRight(8).startsWith(" ") && str.takeRight(8).count(_ != " ") == 7)
        //no space
        str.takeRight(7)
    else if(str.takeRight(9).startsWith(" ") && str.takeRight(8).count(_ != " ") == 7)
        //1 space
        str.takeRight(8)
    else if(str.takeRight(10).startsWith(" ") && str.takeRight(8).count(_ != " ") == 7)
        //2 spaces
        str.takeRight(9)
    else if(str.takeRight(11).startsWith(" ") && str.takeRight(8).count(_ != " ") == 7)
        //3 spaces
        str.takeRight(10)
    else
        ""
)

Applying this transformation on my string and analyzing the patterns

originalDf.withColumn("last7", strTransform($"originalColumn"))
.withColumn("last7_wo_space", regexp_replace($"last7", " ", ""))
.filter($"last7_wo_space".rlike(pattern_LLDDDLL))
.withColumn("pattern", regexp_replace($"last7_wo_space", "[A-Za-z]", "L"))
.withColumn("pattern", regexp_replace($"pattern", "[0-9]", "D"))

Now I could count the number of times an inaccurate pattern appears in the data and focus on the most commonly occuring errors. This is the best I could come up with to analyze and fix the patterns at a time with no AI help. I am sure better ways must exist.