Its Good To Be Regular!

It’s good to be a regular expression that is.  (What did you think I meant? 😉 )  Anyway, last week we talked about using patterns when defining domain rules.  While pattern matching can solve some problems, it cannot solve all problems.  For example, suppose you want a 4-character string that must begin with a letter between A-F.  Pattern matching may help you look for letters in a string, but it cannot limit which characters are acceptable.  (An ‘extreme’ case you may not have thought about since last week is that a character can be a numeric digit or symbol, but a number in a pattern cannot be a non-numeric character.)  Another good example is when I want the user to enter a hex code for a color.  Hex codes range from 00 to FF.

The way to define a domain rule for these situations is not by using a pattern.  Rather, a regular expression lets you control the specific characters allowed at every position in a string and can be extremely flexible.  First, let’s look at some of the rules as they might apply to a specific example.

Let’s go back to that first case where I want a 4-character string that begins with a letter between A-F.  I can begin the regular expression with the string: [A-F].  This would allow the string to begin with any character from ‘A’ through ‘F’.  However, the string definition of regular expressions would tell me that I really need to use [A-Fa-f] so that the user could enter either upper case or lower case letters.  While that is true for applications development using regular expressions to validate input, DQS treats the comparisons as case insensitive and so you can use either [A-F], [a-f], or [A-Fa-f].

Note that the text within the closed bracket represents just one character position even through several characters may appear.  If I wanted to validate against a non-sequential set of characters such as in [ABEFMPST], that would be a valid way to insure that the character of the domain value is one of these eight letters.

If I want to allow most letters in the alphabet with the exception of only a few, I could specify the characters not allowed in the character position using an expression like [^IOQ].  This expression would allow any character except the letters ‘I’, ‘O’, or ‘Q’.  By itself, this would also allow numeric digits so I may want to use [^IOQ0-9] instead.

Everything I talked about so far only applies to the first character.  In my example, I want the remaining three characters to be numbers.  I could change my regular expression to: [A-F][0-9][0-9][0-9].   However because the second through fourth characters are defined the same way, I can use the following expression to indicate that I want to use the same character definition for the next three characters.:  [A-F][0-9]{3}.  The number 3 in the curly brackets indicates that the previous character expression should be repeated for three characters.

Interestingly enough, this regular expression would also look for four consecutive characters in a larger string that began with a letter from A-F and was then followed by 3 digits.  In fact, it would declare the following value to be valid:  45 Main St, Ste D104.  You see, by itself, the regular expression is available to match characters anywhere within a string.  If I want to force the expression to match string values that begin with a specific sequence, I must start the string with the caret character as in: ^[A-F][0-9]{3}.  With this string as the regular expression, the above address would not be considered a valid match.

What if I don’t know how many time a character definition needs to be repeated?  I could use any of the following:

^[A-F][0-9]*    This allows for zero or more numbers after the letter

^[A-F][0-9]+    This allows for one or more numbers after the letter

^[A-F][0-9]?    This allows for zero or one numbers after the letter

^[A-F][0-9]{3,}   This allows for at least 3 numbers after the letter.

^[A-F][0-9]{3,6}   This allows for at least 3 but no more than 6 numbers after the letter.

So I might think that I could use ^[A-F][0-9]{3,3} to insure that valid values began with a letter followed by three and only three numbers.  Unfortunately, it does not work like this.  Rather, there is another character that I can add to the end of an expression that basically says that the string must end with the defined expression.  That character is the dollar sign.  Therefore, I could use ^[A-F][0-9]{3}$ to insure that the string only has four characters and that those characters begin with a letter A-F followed by three digits.

Let me say that what I have covered here is just the tip of the iceberg when it comes to regular expression capabilities.  There is much more that I can do with expressions.  However, my emphasis is to cover BI related topics such as PowerPivot, SSAS and DQS, not to go off on a multi-week tangent about regular expressions.  Therefore, I’m going to give you a few references to let you explore the richness of regular expressions on your own.

As I said, this has just been but a brief introduction into the world of regular expressions. There are many sites that will teach you how to build regular expressions.  A good place to start might be http://www.regular-expressions.info/tutorial.html.

In closing, you may want to download a free tool that will help you discover how regular expressions work.  The tool name is EditPad Pro 7 and can be downloaded from http:/www.editpadpro.com/download.html.

C’ya next time when I take a look at matching in DQS projects.

Mike