I’m a little short of time this week because I will be in Jacksonville by the time you read this presenting at the Jacksonville SQL Saturday, but I wanted to give you something to chew on this weekend. So I found this interesting feature of Excel which appears to have been available since Excel 2007, but is probably know by very few of you.
Have you ever had a list of data that you needed to parse? For example, suppose you have a list of email addresses as shown in the following figure from which you wanted to extract the person’s first name.
I suppose you could begin in the first row and enter ‘Mike’, then enter ‘Bill’ in the second row, ‘William’ in the third row, etc. to enter the first name of each person manually. You might also use a combination of Excel functions to find the period that separates the first and last names and then take all of the characters to the left of the period. To do that, you might use a formula like the following:
= LEFT(A2, SEARCH(“.”,A2)-1)
But what you may not know is that there is an easier way.
First, make sure that your data is a table with or without a header. Then simply enter the first name of the person in the first row; ‘Mike’ and then begin to enter the first name of the person in the second row, “B”.
Notice that in ghost print (light grey), you can see that Excel has already figured out the first names of everyone in the list. Simply press the ENTER key to complete your entry (do not finish typing Bill’s name). Excel fills in the rest of the rows with the person’s first name.
Cool? What if we add another column for last name
This appears to work as well. Excel appears to be able to determine which delimiters surrounded the portion of the name we want to extract from the email address and is able to reproduce that across the rest of the rows.
Similarly, we can parse the rest of the email address to get the following columns.
That’s great, but I had a very specific format to the username. It was:
<<first name>>.<<last name>>
As you can see, the first name was separated from the last name by a period. What would happen if some of the names were separated by an underscore instead of a period? Of course I had to try this, but the initial results were not promising as you can see in the following figure.
I changed two of the addresses further down the list to use underscores instead of periods and when Excel tried to parse the first name, it picked up everything to the left of the first period which was found just before the top level domain portion of the address. My first thought was that this method may not suit my needs. So I tried a case in which the first two names used underscores between the first and last name as in:
<<First Name>>_<<Last Name>>
I was surprised that this worked for both names separated by periods and names separated by underscores (Jason) and periods (William, Sara, Fran, Annie, and Lisa).
I then tried the following case in which the first two names were separate first by an underscore and then by a period.
As you can see, the first name for Sam did not stop at the period, but continued on to an underscore I added between the domain server and the top level domain. (Ok, don’t tell me in a comment that putting an underscore before the top level domain is not a valid format for an email address. I know that, but was merely testing different patterns for parsing.)
Now I was confused. It appears that the pattern is based on the first row defining the separators which will have a preference for the separate used in that first row. For example, if I change the separator on this last example from an underscore to a period, Sam is parsed correctly, but Kyle is not.
This confirms to me that the separator in the first row defines a preference no matter where it is found in subsequent strings. However, if that separator is not found in the subsequent string, the first non alpha character will be used to parse the string as in the case of Jill above.
Of course you can use this technique to parse other column information. For example, I used it to parse address information as in the following figure.
However, I did have a problem pulling out the state name as you can see. I have not yet determined why yet other than the space character appears in many other places in the address and Excel may not be able to develop an appropriate formula to do this. On the other hand, perhaps someone out there reading this can come up with an elegant solution and share it with the rest of us.
In the meantime, hope you were able to make it to Jacksonville. I’ll c’ya next week.