Its Good To Be Regular!

It’s good to be a regular expression that is.¬† (What did you think I meant? ūüėČ )¬† Anyway, last week we talked about using patterns when defining domain rules.¬† While pattern matching can solve some problems, it cannot solve all problems.¬† For example, suppose you want a 4-character string that must begin with a letter between A-F.¬† Pattern matching may help you look for letters in a string, but it cannot limit which characters are acceptable.¬† (An ‘extreme’ case you may not have thought about since last week is that a character can be a numeric digit or symbol, but a number in a pattern cannot be a non-numeric character.)¬† Another good example is when I want the user to enter a hex code for a color.¬† Hex codes range from 00 to FF.

The way to define a domain rule for these situations is not by using a pattern.¬† Rather, a regular expression lets you control the specific characters allowed at every position in a string and can be extremely flexible.¬† First, let’s look at some of the rules¬†as they might apply to a¬†specific example.

Let’s go back to that first case where I want a 4-character string that begins with a letter between A-F.¬† I can begin the regular expression with the string: [A-F].¬† This would allow the string to begin with any character from ‘A’ through ‘F’.¬† However, the string definition of regular expressions would tell me that I really need to use [A-Fa-f] so that the user could enter either upper case or lower case letters.¬† While that is true for applications development using regular expressions to validate input, DQS treats the comparisons as case insensitive and so you can use either [A-F], [a-f], or [A-Fa-f].

Note that the text within the closed bracket represents just one character position even through several characters may appear.  If I wanted to validate against a non-sequential set of characters such as in [ABEFMPST], that would be a valid way to insure that the character of the domain value is one of these eight letters.

If¬†I want to allow most letters in the alphabet with the exception of only a few, I could specify the characters not allowed in the character position using an expression like [^IOQ].¬† This expression would allow any character except the letters ‘I’, ‘O’, or ‘Q’.¬† By itself, this would also allow numeric digits so¬†I may want to use [^IOQ0-9] instead.

Everything I talked about so far only applies to the first character.  In my example, I want the remaining three characters to be numbers.  I could change my regular expression to: [A-F][0-9][0-9][0-9].   However because the second through fourth characters are defined the same way, I can use the following expression to indicate that I want to use the same character definition for the next three characters.:  [A-F][0-9]{3}.  The number 3 in the curly brackets indicates that the previous character expression should be repeated for three characters.

Interestingly enough, this regular expression would also look for four consecutive characters in a larger string that began with a letter from A-F and was then followed by 3 digits.  In fact, it would declare the following value to be valid:  45 Main St, Ste D104.  You see, by itself, the regular expression is available to match characters anywhere within a string.  If I want to force the expression to match string values that begin with a specific sequence, I must start the string with the caret character as in: ^[A-F][0-9]{3}.  With this string as the regular expression, the above address would not be considered a valid match.

What if¬†I don’t know how many time a character definition needs to be repeated?¬† I could use any of the following:

^[A-F][0-9]*    This allows for zero or more numbers after the letter

^[A-F][0-9]+    This allows for one or more numbers after the letter

^[A-F][0-9]?    This allows for zero or one numbers after the letter

^[A-F][0-9]{3,}   This allows for at least 3 numbers after the letter.

^[A-F][0-9]{3,6}   This allows for at least 3 but no more than 6 numbers after the letter.

So I might think that I could use ^[A-F][0-9]{3,3} to insure that valid values began with a letter followed by three and only three numbers.  Unfortunately, it does not work like this.  Rather, there is another character that I can add to the end of an expression that basically says that the string must end with the defined expression.  That character is the dollar sign.  Therefore, I could use ^[A-F][0-9]{3}$ to insure that the string only has four characters and that those characters begin with a letter A-F followed by three digits.

Let me say that what I have covered here is just the tip of the iceberg when it comes to regular expression capabilities.¬† There is much more that¬†I can do with expressions.¬† However, my emphasis is to cover BI related topics such as PowerPivot, SSAS and DQS, not to go off on a multi-week tangent about regular expressions.¬† Therefore, I’m going to give you a few references to let you explore the richness of regular expressions on your own.

As I said, this has just been but a brief introduction into the world of regular expressions. There are many sites that will teach you how to build regular expressions.  A good place to start might be http://www.regular-expressions.info/tutorial.html.

In closing, you may want to download a free tool that will help you discover how regular expressions work.  The tool name is EditPad Pro 7 and can be downloaded from http:/www.editpadpro.com/download.html.

C’ya next time when I take a look at matching in DQS projects.

Mike