How Many People Must You Ask in a Survey?

Have you ever conducted a survey to help make a decision?  Perhaps you helped someone else build a survey to help them make a decision.  I’m sure you have taken surveys.  I seem to get several every week in my email from various companies and organizations.  Then there are the surveys on the back side of your receipts from restaurants and stores.  You might even get surveys in the mail.  I remember when they use to include a nickel in the envelope with the survey to make you feel guilty about keeping the nickel and not filling in the survey.  I guess a nickel doesn’t buy a lot of guilt today.

But how many of the surveys do you think they get back?  Ninety percent?  Seventy five percent? How many people need to respond to a survey in order to make valid predictions?

I guess the answer depends on who you send the survey to and what questions you ask.  If you are a auto manufacturer and you want to evaluate your customer satisfaction with a new model, you do not want to send the survey to all automobile owners.  On the other hand, if you want to find out what features would entice owners of automobiles from other manufacturers to switch and buy one of your vehicles, you may want to exclude owners of your cars.

You see the dilemma?  The type of question should be closely tied to the audience to whom you send the survey.  If you cannot narrow down your audience, perhaps the questions in your survey are too broad and you should consider making two or more surveys to target specific audience groups with question that would be important to them.

How many questions should you put on a survey?  The more questions you include, the less likely someone will take the time to answer them.  For myself, anything more than a half dozen questions and I’m bored and ready to stop taking the survey.  One way you can counterbalance this tendency is to offer greater rewards for completing the survey.  Restaurants often offer free appetizers or deserts or even menu items for completing their surveys.  Stores may offer a certain percent off your next purchase.  Internet surveys have offered everything from cash/gift cards to thumb drives and even iPads.  Would you fill in a 100-question survey for the chance to win a 4 GB thumb drive?  What if they were offer an iPad for 5 lucky survey takers?  You might want to guess at how many people are willing to take that survey.  If you think only 500 people will take the survey, you might be more willing to spend the next half hour completing the survey than if fifty thousand people were to take the survey.  On the other hand, a survey that offers nothing in exchange for my time will probably wind up getting filled in my circular temporary storage bin otherwise known as a trash can.

So you have identified your survey questions and you have a targeted list of people you will be asking to take the survey.  You even have a reward program set up to encourage people to trade their time for a chance at a ‘gift’.  What percent of your target audience do you need to get a response from in order to have a reasonable chance that the survey will represent the total population?  Take the last presidential election as an example.  Did any of the survey takers ask you whom you might be voting for?  Probably not.  In all of the years that I’ve been voting, I’ve never once been asked by a survey taker who I was going to vote for or who I did vote for.  So whom are they asking?  Would it surprise you to know that they can get a pretty good idea of the way the election will turn out by only ask a few thousand people?  Makes me wonder if we just cannot randomly select a few thousand people from around the country to cast their ballots instead of trying to get all of us to drive to the polls and then wait in lines for hours to cast our votes.  If the odds are that we would get the same results, imagine the time we would save.  After all, isn’t that what manufacturers do when they conduct product surveys to determine what goods to make and sell to us?

I was recently involved in a survey with a very well defined maximum population.  The survey was open for several weeks, but I can evaluate the cumulative data as of any day beginning with only a percent or two up to the final 60+% results.  The survey owners were still trying to get people eligible to take the survey to complete the survey before the deadline even though they already had survey results from over 60% of the eligible survey takers.  Was this overkill?  Would a few more percent make a difference in the outcome of the decision?

Over the next couple of days (or weeks if I get busy doing something else), I’m going to pull the data into a PowerPivot model and evaluate the results to some of the questions over time as more and more people take the survey.  I’m curious to see if getting more people to take the survey really made a difference.  I’ll report back to you what I discover in a future Tuesday blog.

Until then, think about it and try to reason out what the results will look like.  C’ya!


I Think I See A Pattern

Last week I mentioned a little about domain rules but other than using a few simple ones, I did not mention a lot of detail.  Well, just for the record, the following list gives you an idea of all of the different Boolean expression types I can build with a domain rule:

  • Length is equal to
  • Length is greater than or equal to
  • Length is less than or equal to
  • Value is equal to
  • Value is not equal to
  • Value contains
  • Values does not contain
  • Value begins with
  • Value ends with
  • Value is numeric
  • Value is date\time
  • Value is in
  • Value is not in
  • Value matches pattern
  • Values does not match pattern
  • Value contains pattern
  • Value does not contain pattern
  • Value matches regular expression
  • Value does not match a regular expression

Most of these are (or should be) self-explanatory.  However, this week I want to take a quick look at the pattern matching expressions.

First, these are not the same as building a regular expression.  I will cover them in the future.  Rather a pattern is much simpler.  In fact, there are only four basic rules to building a pattern.  They are:

  • You can define a position in the string to represent a letter
  • You can define a position in the string to represent a number
  • You can hard-code a position in the string to be a specific special character such as a hyphen
  • You can define a position in the string to be an optional letter or number

For example, one of the patterns I might want to match is for a zip code in the United States:


(Note: I can use any numeric digit 0-9 to represent a number in any position of the string.  Therefore the pattern: 00000 is the same as 12345 is the same as 99999.  They all merely represent a string of five digits.)

This pattern implies that the user must enter a string of 5 digits, not 4 and not 6.  Five digits and only five digits will be accepted.

This means that the following zip code would not be accepted using this method:





The first example is only four digits and is short by one digit.  The second example has 6 digits and has one too many digits.  The remaining two examples include letters and numbers which while valid in some countries is not valid in the United States.  To allow these I would need a patter like:


However, some people may want to enter their zip plus 4 zip code.  To do this, I need to allow for the hyphen between the fifth and the sixth digit.  Therefore, I might use something like:


That pattern would be great if everyone entered their full zip plus 4 zip code.  However, not everyone knows what their full zip code is and while using the full zip code makes the sorting and delivery of mail more efficient, it is not yet required.  Unfortunately, if I specified this pattern for my zip code field domain rule, records with only the standard 5-digit zip code values would fail the rule.

In order to give the user an option as to whether to use either their 5-digit zip code or their 9-digit zip code I could do this:


Notice that the two separate clauses for 5-digit zip code and 9-digit zip codes are joined with an ‘or’ connector.  It should be obvious that an ‘and’ connector would not be appropriate in this case. Anyway, there is a simpler way to accomplish the same rule using a single clause by defining the last 5 characters (the hyphen and the 4 digits) as optional as shown in the following figure.


As seen in this example, I can combine multiple character positions within parenthesis to indicate that they are optional, that includes special characters.

While all of the above examples defined patterns for the entire string to match, I can also define patterns that the domain values must not match for the values to be valid.  Similarly, I can use pattern matching to match a portion of the string by using the value contains pattern option.  Note that values that begin or end with the matching pattern would be valid.  The pattern could even be in the middle of the string.   I can even evaluate strings to be valid if they do not contain a specific pattern anywhere within the string.

However simple pattern matching may be, it has its limitations.  The biggest limitation is that I cannot specify a specific character in a specific position.  For example, I cannot specify a 4-character string in which the first letter can only be ‘A’, ‘B’, ‘C’, or ‘D’.  This type of comparison and many more require the greater complexity of the regular expression clause which I will look at next time.

Until then, c’ya!

A Conversation in the Night, Part 2

“We need to go soon.”

“How much time has passed since I got here?”

“Nearly a week I suppose.”

“I need to talk to our son one last time before we move on to the next phase.”

“Ok, but you need to do it now.  Fortunately, it is nighttime there so he should be asleep.”

I’m sitting in a room that is totally white, white walls, white floor, white ceiling, even two white chairs.  The odd thing is that there does not appear to be any windows and I don’t see a door, even after turning completely around.  But even odder is that I don’t know how I got here or even where here is.  As I turn back toward the two white chairs, I am startled by the appearance of someone sitting in one of the chairs.  I did not hear anyone come in.  I don’t even know how they could have come into this room without doors or windows.  Then I look closer at the person sitting in the chair and was shocked to see that it was my mother.

“Son, come sit down here next to me so we can talk for a bit.”

“How is this possible?  You have…had Alzheimer’s and the last I saw you, you could not talk to anyone.  Then you died a week ago.  You cannot possibly be here.  Wow, there must have been something really bad in that seafood I ate last night.”

“First son, we don’t use terms like ‘died’ here.  And it was nothing you ate.  The physical body may fail, but the soul, the spirit, the essence, or whatever you want to call it moves on.”

“What do you mean moves on?”

“Well that a bit harder to explain in terms that you can understand now.  Let’s just say that the core of what makes you an individual survives.”

“So your spirit survives and you remember everything that happened?  You remember me?  Because the last time I saw you, you did not seem to remember me or anyone else.”

“I remember with more clarity than ever before.”

“So, did it hurt when you died…er…moved on?”

“No, not really.  I’ll tell you what really hurt.  What really hurt was being alive in a shell of a body that could no longer communicate with the ones I loved.  What hurt was seeing the pain in their faces as they looked at me.  What hurt was hearing them talk about me as if I could not hear anything or know what they were saying about me.  You see, my spirit could still hear and see everything.  I just had no way to respond.  For example, I remember the day several months ago when you came to see me with your wife and daughter to tell me that my husband, your father, had died.  I cried so hard inside and wanted to scream out.  But most of all, I wanted to get out of that chair and hold all of you, but it was impossible for me to command my body to get up.  It just sat there and stared ahead.  I’m sure you thought I did not understand what you said.”

“Yes, that is exactly what we thought.  However, I saw a little tear form in the corner of your eye and although I never told my wife, I was sure that maybe something of what we said did get through to you.”

“Then you all got up to leave and I tried to scream out not to go, that I wanted to be with family at least for a little bit more.  I wanted to get the nurses to call you back inside, but no sound came out of my lips.  That is when I started to pray for the end to come.  I prayed to Him to take me out of my misery and let me be reunited with my husband.  I prayed for an end to my suffering which I knew would never improve.  Over the following days, I lost track of time.  Time meant nothing to me anymore.  Days blended into nights as a kaleidoscope of light and dark seemed to pass by me.  Then one day Jake came to me while I slept.”

“At first I was startled and afraid, but Jake comforted me by explaining what had happened to him after he died.  He told me that he had been given time to stay behind and watch over me until my time to join him would come.  It was strange at first talking to him as I slept.  It was as if he was in the same room with me.  But over the weeks that followed, I looked forward to seeing him every night and talking with him about all we had done over the years.”

“You mean like we are talking now?”

“Yes, exactly.”

“Does that mean that I’m dying too?”

“Well, we all die in time, but your time is not quite yet.  I reached out to you because I felt that I needed to talk with you one last time to let you know that everything is OK.”

“What do you mean one last time?”

“When our physical body dies, we all move on to another existence based on the life we led.  Sometimes it is with others that we knew from before.  Sometimes it is with strangers.  You dad and I will be going on our next journey together again and he has been waiting for me.”

At that point I heard a door open.  I swiveled in my chair to look behind me and saw a door open where there was no door before.  Then my dad walked into the room.

“Hi son!  I guess you know that I’ve come for your mother.  It is time for us to go now.”

He walked over and held out his hand to my mom to help her up.  It was then that I noticed that she did not look old anymore.  Neither did my dad.  In fact, they looked young again, rather like they did in their wedding picture.

As my mom got up, I stood too and embraced her begging her not to leave.  My dad put his hand on my shoulder and said that this is the way it has to be and that I had my own family to take care of.  In fact, he said that soon one of them would be needing a lot of my help.  I asked him for details, but he said that they were not allowed to say more and perhaps had already said too much.

As my mom released me from her embrace, I noticed a small tear in the corner of her eye, much like the one I saw months ago when I tried to tell her that her husband had died.  She took my dad’s hand and together they walked to the door.  As they got to the door, they turned to face me and said, “Don’t worry and don’t be sad.  Everything is OK.”  Then they turned and walked out the door.

When I woke up in the morning, I felt as if I had been up all night.  My wife told me that I was mumbling all night, but that she could not understand a word of it.  However, it did keep her up half the night.  She wanted to know if I was having a bad dream about the death of my parents.  I looked at her and said, “Don’t worry.  It’s fine.”

From that day forward, I was at peace with the passing of my parents, now ten years ago.  It doesn’t mean that I don’t miss them, but the pain of their loss is gone because I know they are happy and together again.  I just hope that is true of all those we lose too soon.

End of story.

C’ya next time.

PS.  Saturday’s entry, if it happens at all, will be on Sunday this week due to spending the day in Jacksonville at the 2013 Jacksonville SQL Saturday.

Composite Domains in Data Quality Services – Part 2

Last week I introduced Composite Domains in Data Quality Services and showed that you must first create individual domains on the columns that you want to include in the composite.  When you use the composite many people may at first assume that somehow DQS stores value pairs from each domain from each row of the source table to create the composite domain.  They then think that they can somehow work with those pairs to cleanse other data sources with the same fields (domains).

After some testing, what I found is that each of the fields in the composite domain is treated as the separate domain that was used to define the composite.  When the composite domain is used to cleanse data in a new table, it is not the combination of the fields that DQS uses to cleanse the data, but the individual domains.  Let me use an example to make that clear.

Suppose you have a table of addresses that you want to use with knowledge discovery to create domains for the fields CITY, STATE, and ZIP.  Each record in the source data represents a specific place that can be identified by the combination of these three fields.  Furthermore, your address data set may have thousands of records from the same state.  However, that state value is stored in the knowledge base only a single time.  Similarly, there may be hundreds of records with the same zip code and that zip code is stored in the zip domain one once for each unique value.  Finally, each unique city name is stored once in the city domain even though that city may appear in dozens of states.  For example, I live in Orlando, but which one?  Most people would immediately assume Orlando FL, home of Disney World.  However did you know that all of the following cities also exist.

  • Orlando, Kentucky
  • Orlando, Oklahoma
  • Orlando, West Virginia
  • Orlando, Arkansas
  • Orlando, North Carolina
  • Orlando, New York
  • Orlando, Virginia

…and I may have missed a few.

Furthermore, some small towns near a big city have their own zip code, but you can also use the same zip code with the name of the larger city and mail will still be routed correctly.  (Because mail is routed by zip code, not by the city name on the envelope.)   For example, you could say Belle Isle, FL 32809 or Orlando, FL 32809.  However, you cannot say Belle Isle, FL 32751 because that zip code belongs to Maitland, FL.

So how do you cleanse address data?  Well actually, the best way is to use a service such as one the address databases found in the Azure Marketplace.  However, if you do not have access to the Azure Marketplace or you do not have the money to spend on some of these services, can you roll your own solution?

The compound domain containing city, state, and zip will not satisfy your need.  The problem is that Belle Isle is a valid city in the cities domain and 32751 is a valid zip code in the zip code domain. Therefore, the composite domain will not by itself identify this combination as invalid.  However, you can start to build rules within the composite domain that can identify combinations of values from different domains and ‘correct’ one or both values. Let’s see how to do that.

In the following figure, I am in the Domain Management section of an existing knowledge base that already had domains for city and zip that I previously combined into the composite domain CityZip.  If I click on the CD Rules tab, the grid that appears is initially empty.  However, I can click on the Add a new Domain Rule icon found in the upper right corner above the grid.


In the top grid, I can enter a name for the rule along with a description.  It is always a good practice to keep the name short and the description more details.  Also I want to make sure to check the Active checkbox if I want the rule to be used the next time I use the domain to cleanse a data file.

In the lower grid, I visually build the rule definition.  The left side of the rule definition can be thought of as the IF portion of the rule and the right side is the THEN portion.  A rule must begin with a reference to one of the domains contained in the composite domain.  Then I select what I want to do with the domain value.  You can see from the dropdown list that the number of actions available is large.  I can compare domain values to other values, I can check the length of a domain value, or I can even check of the domain value exists among many other actions.


In this case, I want to look for records that have the city name Taft and the zip code 32824 and change the name of the city from Taft to Orlando.  Therefore, I select the connector: Value is equal to and supply the value: Taft.

With the current clause still selected, I can click the Adds a new condition to the selected clause icon in the upper right corner of the left panel.  I can define multiple clauses for each domain linking the clauses with either an AND or an OR connector.

On the other hand, I may want to add one of the other domains to the rule.  Remember that each domain can appear once and only once in the left panel of the rule.  To add another domain to the rule, you must make sure that none of the current domains are selected.  A clause is selected within a domain when you see a blue box around the clause expression.  You can tell if a domain is currently selected by the vertical line that drops down from the domain line on the left side of the clauses.  If this line is dark blue, the domain is selected.  If this line is light blue-grey, it is not selected.  One way to make sure that no domain or domain clause is selected is to click in the blank area on the right side of the left panel.  You should then be able to click the Adds a new condition to the selected clause which adds a new clause to the rule.

If I want to assume that there may be some zip codes for the city of Taft that are valid and that I do not want to change them, I need to create a compound condition to find only the Taft records with a zip code of 32824.  I can do this by right clicking outside of the first rule in the empty are and add a new condition.  In this second condition, I select the Local_Zip domain and compare it to the value of 32724.

Finally, I can define the action that I want to take when a record is found with both the selected city and zip code.  On the right side of the lower grid, I can select the domain Local_City again set the value equal to Orlando.  This rule when applied during the cleansing process of a source table will look at the field associated with domain Local_City and if the city is Taft and if the field associated with the domain Local_Zip is 32824, it will change the value of the city field in the cleansed data to Orlando.


I can now click the Finish button for this rule.  Similarly I can create similar rules and can even create more complex rules to handle other special situations. I can even create more complex rules to check a single domain than a simple replacement of a synonym value would allow.  Just look back at the different connectors that are available to see how unique you can make your matching criteria before you assign a new value to a field.

For the sake of this example, I will not add any more rules, but will instead use my updated domain to cleanse a file that includes address information using a project as shown in the following figure (I previously talked about how to create a new project and use an existing knowledge base with its domains to cleanse the data and will not repeat those steps here.  For help, refer back to March 30th blog entry.

The following figure shows the first step of the Cleanse process for my project.  You can see that the source table has 282 records in which it corrected one record.


When I click the Next button at the bottom of the Cleanse step, I can see a tabbed form in which each tab shows me records that DQS made suggested changes (usually as a result of the spell checker), new records that it never saw before, invalid records which have no match and no rule to fix the values, corrected records which were updated either with synonyms or rules like the one we created here, and correct values.

By clicking on the Corrected tab, I can see exactly what DQS has attempted to do.  Most importantly, I can choose to accept or reject the correction.  The radio buttons for these options appear to the far right in the top grid (not visible in this picture).


If I accept the corrections, I can then in the next step output the corrected data to an existing or a new SQL table or Excel file as shown in the following figure.  As shown before, I can save just the updated data or the updated data with additional information about the cleansing of any updated record.


So the value of a compound domain is to create rules that use two or more domains to make a decision about a change or may make a change to two or more domains based on a condition in one or more source fields.  It should be clear that defining compound domain rules is a complex task, more complex than merely defining a set of synonyms and a leading value for similar column values.  Furthermore, I would not recommend executing rules for compound domains until I have first cleaned up individual domain values as described in earlier weeks.

Cleansing data should be thought of as an iterative process.  Begin with the easiest data problems to identify and fix synonyms and work your way through more complex updates such as conditional replacements in subsequent passes.  Next time I will take a look at matching records using DQS.  Matching helps you remove duplicate records from a table.  While DQS will not force you to perform cleansing tasks before performing a de-duplication, it is highly recommended and I will try to show you why.

Oh, one final thing.  While I showed the cleansing process last week and this week with simple address data so that you can easily follow the techniques, remember that you may save a substantial amount of time validating addresses against national address databases that may be available on the Internet and the Azure Marketplace.

C’ya next time.

A Conversation in the Night

“Hey Sharon, she’s talking again!”

“Yeah right, Beth.  She hasn’t been able to even put together two words in a single sentence in over a month now and you want me to believe that she is talking.  You said the same thing the last two nights and both times when I got there I heard nothing.”

“I’m not making this up.  Come quick and hear for yourself.”

Sharon got out of her chair by the front desk.  She was the night duty nurse for the home.  She was responsible for 15 Alzheimer patients and the three night staff including herself.  Most nights were pretty quiet, but the last several nights, Beth kept reporting that one of the patients, Agnes, would talk in her sleep.  Talking in one’s sleep would not be such a big deal except that Agnes was in the final stages of Alzheimer disease and was no longer able to communicate at all with the staff. The last two nights when Beth claimed that Agnes was talking, Sharon walked down to the end of the hall where Agnes’s room was and all she heard was incoherent muttering.  Probably would be the same tonight, but she had to go down and check anyway for the record.

“All right Beth, I’m coming” said Sharon as she walked down the hall.  She thought she could hear someone talking and at first thought it might be one of the patients in one of the other rooms.  However, as she entered Agnes’ room, she could not believe her ears.  It really was Agnes.

“Yes Jake, I understand that it is was hard for you, but it was not your fault.  Stop blaming yourself.  There was nothing that you could have done to prevent my getting Alzheimer’s disease.  There still is no cure for it.  I know we both denied it at first.  We both thought that my memory lapses were just simply the result of growing old.  With so much more to remember year after year, it is amazing that we remember as much as we do.  But you cannot blame yourself.  It ruined your health and see where it got you?  I do remember when that nice couple stopped by with that little girl to tell me that you had died.  It tore me apart inside, but I could not express myself.  I had already reached the point in which I could not respond to people talking to me even though inside I desperately wanted to say something.”

A short pause occurred and then Agnes continued.

“Oh yeah.  You’re right.  I just didn’t recognize them.

There was a pause and both Sharon and Beth thought that was the end of it.  They had turned and were about to leave the room when Agnes continued.

“I know that I will never get better, Jake.  So I’ve been asking God to talk with our son and somehow get him to understand that it is time to let me go.  He cannot think selfishly and try to hold me here.  I feel his thoughts preventing me from leaving.”

Another pause of a few seconds.

“Yes, he did respond back and said that as much as he did not want to see me die, that if there was nothing to cure me and if this disease was ultimately fatal, that he was Ok if I passed.”

A short pause, then Agnes continued.

Yes I know how much it affected him when you died, but you died relatively suddenly.  He was not expecting it.  He knows that my end is near and he understands that you are waiting for me to cross over.”

There was another pause of about a minute.  By this time, both Sharon and Beth figured that they were only hearing one side of a conversation and that the pauses represented the times when the other person in the conversation was talking.  So they waited in the dark to listening for anything else that Agnes might say.  But she remained quiet.  So after about ten minutes, they left the room.  Sharon returned to the front desk and Beth continued her rounds checking on the other patients.

“Ok, Jake, they are gone now.   Thanks for telling me that they had come back into the room to listen again.”

“Agnes, my dear loving wife.  I’ve missed you for so long.  I could feel the pain of not having you by my side affecting my health, but I could not let it go.  Since I died, I’ve been by your side, watching over you and hoping for some improvement, but I received word today from Him that the end is close.  So I used all the power I could gather to communicate with you tonight, to help you cross over.”

“Jake, I waited because I want to talk to our son and tell him that everything will be alright.  It will soon just be him, his wife and his daughter.  I need to talk with him one last time to let him know how proud we are of what he has become.”

“You know that is not possible.  Spirits can talk with the souls of the living, but two living souls cannot talk like we are tonight.  There is no way to talk with him.”

“Do you know if He has told our son that the end is near?”

“Yes He has.  It happened a week ago in their time as he was driving home from work.  Our son was overcome with grief at the news and stopped for a few minutes by the side of the road.  But he finally told Him that he understands and accepts the fact that you need to leave.”

“ So everything is set?  I wish he were here.”

“With him living on the other side of the country, there is no way, especially since you do not appear to be terminally ill yet to alert these nurses.  That illness will occur in the next several weeks and would be extremely painful.  He was not told the nature of the illness that would soon consume your body if it were to be kept alive, but he understands that he has been given the chance to petition God to release your spirit prior to the onset of that disease.  He does not know how close the timing is and so will not be able to be by your side.  But don’t worry, I’ll be with you every second through the end.”

“Jake, I guess I’m ready then.  Please hold my hand.”

Back at the nurses station, the monitor for the bed in Agnes’ room started to beep indicating that the patient’s heart had stopped.  Sharon quickly glanced up and seeing the flat line on the monitor rushed from the desk down to Agnes’ room.  On the way, she called to Beth to follow her.  As they reached the room, they thought the saw a soft glow moving out toward the window.  It must have been a passing car.  They checked for a pulse and for any indication of breathing but found no sign.  They knew Agnes was gone.

To be continued…

Composite Domains in Data Quality Services – Part 1

In the last several installments, I have looked at using Data Quality Services to clean individual data columns.  However, sometimes you can have correct data for two or more columns individually but the combination is still wrong.  What do I mean by that?  Let me keep it simply by just referring to two common pieces of information in an address: City and Zip Code.  It is possible to have valid cities in the the City column along with valid zip codes in the ZipCode column, but the zip code and city values on any one record within the table may be correct.  So what can do to check for this problem?  Let’s begin by creating a few domains, the first being for zip code


In this example, I began by creating a new Knowledge Base.  Using the Knowledge Discovery option along with an existing table of data, I started by selecting the source column: ZIP from the ZIPCITY table.

Since this is a new Knowledge Base, I have no existing domain to select from the dropdown in the Domain column of the Mappings grid.  Instead, I clicked the Create a Domain icon immediately above the right side of the mappings grid.  I want to create a new domain called Local_Zip as shown in the above figure.  I did not change any of the other default values other than to turned off the Enable Speller option since that functionality does not apply to a zip code.

I then repeated this process to create a domain called Local_City to store the valid city names from my discovery database.  Since we have seen these steps in previous weeks, I will skip the details here.  If you need help defining new domains, refer back to the blogs from March 23, 2013 and March 30, 2013.

Next, I click on the Create a composite domain icon to the immediate right of the Create Domain icon as shown in the following figure.


I need to create a composite domain that consists of both the city and zip code fields.  In the resulting dialog, shown below, I named the composite domain: ZipCity.


Then from the Domains List that shows the existing domains defined in the current knowledge base, I select the domains I want to work with and move them over to the Domains in Composite Domain list.  The order of the domains here does not matter.


After clicking OK in this dialog, I now see a new domain in the Knowledge Base details panel.  You see the first two individual field domains: Local_City and Local_Zip followed by the ZipCity domain.  You may also notice that the icon to the left of the domain name is also different from the other domains indicating that this domain is a compound domain.  If the domain is not expanded to show the individual domains that compose it, you can click the small blue box to the left of the domain name with the plus sign to open the domain.


I could click the Next button at the bottom right of the screen to proceed to the Discover step.  On this page, I can click the Start button in the upper left to have DQS begin to examine my data source to populate my domains.


After a few seconds (or minutes depending on the table size), the grid at the bottom of the page refreshes to show me the results of the data imported from my data source.  As we saw in previous weeks, I can see the number of new values as well as the number of unique values for each of my domains.  In the figure below, you can see that my table of 49 records had 44 unique zip code values and only 12 unique city values.


When I proceed to the next page (Manage Domain Values), I can review the domain values discovered for each of the single column domains.  I can make corrections at this point to any misspellings or incorrect values.  On the left side of the screen, you see the list of available domains and the number of new values added to the domain.  You may have noticed that the composite domain, ZipCity, does not have a count of new values.  That is because the composite domain is not stored as a combined field.  Rather, it merely references the two domains: Local_City and Local_Zip.  It is not a new combined field.  That distinction will be important in understanding the next step.


The following figure shows the correction of a zip code that had an extra digit.  While simple in of itself, it shows how easily I can correct values beging added to the domain. Alternately, I could have simply flagged this record as bad data, but since I am building a domain, I chose to correct the value instead.


Next, I open a data cleansing project to clean up my table of city and zip code values.  Again I will skip some of the steps here for brevity since they have been covered in previous Saturday blogs on this site.  I begin by mapping the fields from the table I want to clean to my available domains.  However, I cannot map multiple source columns to a composite domain.  Instead, click the button: View/Select Composite Domains from the bottom of the screen that displays a dialog displaying the composite domains.  In my case, there is only one composite domain.  I make sure that I have selected the checkbox to the left of this domain because I want to ‘fix’ any problems with my cities and zip codes.


Clicking first OK to close the Composite Domain dialog and then the Next button to go to the Cleanse step, I click Start to run the cleansing process.  Depending on the amount of data in the table I am trying to clean, this could take from a few seconds to enough time to duck out to the kitchen to get a cup of coffee.


After the cleansing process finishes, I see only one record in my grid at the bottom of the Cleanse screen that lists the field City,Zip and shows the number of corrected and suggested values and I see that both have values of zero.  Does that mean that all of the data in the cleanse table was correct to begin with and where are the other domains?

Let me answer the second question first.  When I clicked the checkbox two figures ago to use the composite domains, this turns off the use of the individual domains.  Similarly, if I uncheck this box to not use the composite domains, the individual domains are used instead.


As to the first question, the absence of corrected and suggested values does not actually mean that all the combinations of cities and zip codes are correct.  At this point, all that DQS has done has to verify that the cities in the source table exist in the Local_Cities domain and that the zip codes found in the source table exist in the Local_Zip domain.  It does not have a way to associate a specific city with a specific zip code since city and zipcode are not really defined as a single field, but rather as a composite domain consisting of two independent domains.  So I may have a valid city name in my source table with a valid zipcode even though that zipcode is not valid for that city.  This is a problem known as cross-domain validation.   By itself, a composite domain does not solve cross-domain validation issues.  However, as we will see in part 2 next week, I can use the composite domain together with some addition rules to solve selected cross-domain validation issues.  When I first encountered this issue, I was surprised that composite domains did not solve this problem, but the more I thought about it and thought about the way the composite domain is defined, the more sense this restriction made.Then I started looking for alternate solutions.

Before I end this week, note that the completeness cell indicates that only 91% of the records are complete and 9% are missing.  In my case, this meant that some of the records in my source table did not have cities or zip codes or possibly both.  Completeness is another component of data quality and having incomplete records distorts your analysis.  Therefore, I will want to either ‘fix’ these records or remove them from the final analysis table.

Next time in part 2 of this this topic, I look at some ways to address the issue of cross domain validation through the use of rules.


Personally Identifiable Information – Beyond SSN

It seems that most people are now aware that collecting SSN information from people and storing it a computer system is a risk to that person’s identity protection.  However, the risk does not stop with SSN.  There are other types of information that you should carefully secure after it is collected and furthermore, you should only collect it if there is a real valid reason why it must be collected.  Collecting information just because you can or just because you might one day need to use it is not an excuse when doing so could result in that person suffering at some point from identity theft.

So what else should be protected?  What about email address.  It seems like everyone wants to collect email addresses these days.  You can hardly log into a site on the web without them asking for your email address.  Some people live their lives happily with only one or two email addresses and they use that email address for everything ranging from securing their phones or other mobile devices, accounts on-line, contact lists, etc.  Now even Windows 8 allows you to use a Hotmail (or Outlook) email address as your desktop login.  How can you protect your email address when you don’t know if the programmers are encrypting that information the same way they encrypt (or should encrypt) SSN values?  Simply go out and get a throwaway email address that you use for logging into web sites.  You can get extra email accounts from many places, but I recommend either Hotmail (Outlook) or Google because you get not only an email address from these two, but you also get a fair amount of online storage where you can store files and documents.  Furthermore, Hotmail (Outlook) also allows you to create alias accounts associated with the main account.  The real advantage of these alias accounts is that you can create rules to redirect all email to these accounts directly to the trash if you have the account only as an email address for sites that require one.  (It is a great way to avoid spam too!)

There are other things that should also be protected such as phone numbers and addresses.  Why do you need to include an address when you register with a web site?  It is not as if they are going to actually send you physical mail is it?  After all, that is one of the reasons why the post office will be stopping Saturday deliveries later this year.  Total mail volume is down.  (You thought not getting all of that junk mail was a good thing!) Junk mail paid for a major portion of the post office expenses.  But it is not just that.  People don’t send letters or cards anymore.  They use email message, instant messages, texting, and e-cards.  Many people don’t get physical magazines either because they read most magazines online and not have to guess what is missing from the torn pages of magazines that get shredded during the delivery process.  People also pay bills online rather than sending checks.  And the list goes on.  So why ask for an address that really isn’t needed, but could be used to locate where a person lives.   Imagine a fictitious criminal organization that buys customer information including email addresses from high-end online sales companies so they can target which communities and even homes are more likely to have valuable stuff that their contracted thugs can burglarize.

Phone numbers to some extent fall into that same category.  It is one thing perhaps for the company on whose site you registered to give you a call about their products or services.  However, it is quite another when they sell that information to information brokers to make a few extra bucks because you didn’t buy anything anyway (or even if you did).

Ok, so we are not going to solve all the privacy problems here.  However, I still would encourage those of you who develop databases or who manage databases to consider encrypting more than just the SSN and passwords of people from whom you collect personal information.  Also, please consider if you are collecting that information because it is something that you or your organization will actually act on or whether you are merely collecting it because you just threw it into the pot along with the kitchen sink.

By sharepointmike Posted in Finance

Removing Your Test DQS Projects

This week I have a short but important piece of information to share with you as you begin to explore Data Quality Services with some test data and test projects.  If you have not only been following along with my examples over the last couple of weeks, but have gone off to create your own, you may have several projects and domains that you now want to get rid of as you move more into some real life examples with your own data.  At first you may be confused as to how to remove your existing projects.  Let’s take a look at how you can clean out these tests.

In the following figure, you can see that I have a few projects that I previously created that now appear in the Data Quality Project panel under the Recent Data Quality Project group of the main DQS client screen.


Let’s assume that I want to remove the project test2.  If I trying right clicking on the project name or the project icon to the left of the name, nothing happens.  If I left click on the project name or project, the project opens taking me directly to the export screen as shown in the following figure.


So that is no help at all.  Rather, you need to click on the group heading: Open Data Quality Project to display the following grid which shows all of your current projects.


Now right click on the project you want to remove and select Delete from the menu.


When you return to the main DQS client page, you will see that the project test2 is no longer in the list of recent projects.


In a similar fashion, you can get rid of an entire knowledge base that you have been testing by click the Open Knowledge Base group header in the Knowledge Base Management panel of the main DQS client screen.


But what if you only wanted to get rid of one of the domains in one of the knowledge bases? First open the Knowledge Base that contains the domain that you want to delete.  Then on the right side notice the list of domains associated with the currently selected Knowledge Base.  You do not need to select the domain here.  In fact, there is not benefit to selecting the domain here.  Rather, immediately select Domain Management from the Select Activity list as shown in the following figure.


On the next screen, first select the domain that you want to delete and then click the Delete the Selected Domain icon as shown in the next figure.


On last point of interest for this week, you might want to select a project to delete after clicking Open Data Quality Projects group as mentioned earlier, but after right clicking on the project name, the option to delete the project is not available.  This occurs if you have been working on a project and close the project without completing it.  The project is then left in a locked state.  The DQS client does this so that you can return to the project later to complete your work where you left off.  If you intend to delete the project anyway, there is no need to complete it now.  Rather, select Unlock from the dropdown menu that appears when you right click on the project.  After unlocking it, you can right click on the project a second time and delete it.

That’s it for this week.  C’ya next time.

Capturing Customers at IKEA

You may not all have a local IKEA store so I will start with a little background.  IKEA is a store that sells ‘affordable’ home furniture and household items. Yes, a lot of the furniture requires ‘some’ assembly.  While it may not be as sturdy or durable (especially if you have parts left over when you are done) as furniture that costs several times as much, it fits the immediate needs of many growing families in which the furniture may not be expected or need to last forever as their children grow and their needs change.  However IKEA has been known for innovation.  First it was the flushless toilets.  Now new for 2013 it is the exitless restaurant.

Yes, most IKEA locations have a restaurant.  They serve a variety of items from breakfast to lunch to dinner.  No, it’s not fancy.  However, this is the same company that recently made news over the culinary innovation of introducting horsemeat in meatballs.  (Did you know that horsemeat has a lower fat content than regular beef?) On the other hand, restaurant service is quick because you pick up your food while walking through a cafeteria style line just like the old days of going to school.   That means there is not a lot of variety.  However, for the price, it is better than going to most fast food places.

Up until recently, it was possible to pop in to get a quick breakfast, lunch or dinner and leave without going through the rest of the store which quite frankly is laid out by the same mad scientist who makes mazes to test whether rats can find their cheese two times in a row.  Yes, sometimes, it is just about getting something to eat.  You could go up to the restaurant by escalator or elevator (for those who cannot take the escalator).  Previously, you could also come back down and leave the store by the same elevator (there is no escalator down which is rather inconvenient for those who may not be able to do stairs).

Not any more.  You cannot just get something to eat and leave.  Recently, they have covered over the elevator buttons so you have no way to leave the restaurant until the store opens.  (Oh, did I forgot to mention that the restaurant opens 30 minutes before the rest of the store.)  Even if you came for one of those wonderful plates of meatballs, you now have to walk through the maze of their store in an attempt to find the shortest path out of the building.  I think it is part of a foreign psychology experiment.  Now I know they want to take advantage of their captive audience, attracted to the store not by a block of cheese, but by a quick and inexpensive meal to force them to pass all of the merchandize on their struggle to find a way out of the store in the hope that some of them might say, ‘Hey, look at that!  I’ve always wanted one of those.’

But what if someone had to leave really quick.  Suppose there was a medical emergency and they had to get out the fastest way possible.  IKEA has blocked the exits. They don’t care if someone is possibly sick or maybe even dying.  They only want to make sure that you wind your way through the store aisles looking at all the merchandise.

Stores with restaurants in them are not all that unusual.  Many large department stores have had them.  So have some Five and Dime stores.  Granted, most of these in the past were only open during regular store hours, but I’ve never seen something like this before where if you wanted to go to the restaurant, that you had to walk through the store aisles to get back out again.  In similar cases where restaurants are located in enclosed malls, the restaurant often has an entrance/exit directly to the outside, not just to the interior of the mall.

While I’ve been making fun of these changes, the real issue on this April 1, 2013 is one of safety.  There are days when the restaurant is fairly full before the rest of the store opens.  Should there be an emergency requiring people to get out quickly, how would they get out with only an up escalator to try to go down on and an elevator in which the buttons have been covered?  Can an ambulance crew get that gurney up the escalator? Is this really safe?  Are there no rules to protect customers patronizing a place of business?  Do they not have an expectation that they will be safe?  If it is so important to have two exits to every room in public buildings, can businesses simply ignore this rule?  I feel sorry for those in wheelchairs or walkers who must wait for the store to open before they can leave.  Even mothers with strollers have to wait until the store opens if their child becomes fussy and they need to leave.  Perhaps a store has a right to do whatever they want in their own property, but should they be required to post a sign at the bottom of the escalator stating that there is no exit from the second floor until the store opens allowing people to make an informed choice?

All these questions to ponder for the first day of April and no answers to offer.