Jun 12 2014

Governance For All

I was thinking about SharePoint governance today and its effect on the success or failure of SharePoint. Fundamentally, SharePoint governance is a document that forms an agreement between the people who support SharePoint and rest of the organization. It then dawned on me that I could substitute SharePoint for any other system. However governance is an umbrella covering many things such as policies, rules, roles, and responsibilities. Within that umbrella of governance, I would like to think of these four areas as represented by the four legs of a chair, the chair being the system. Failure of any one leg may not make the chair crash, but it will make it less stable. Failure of more than one leg can cause the chair to collapse.

The same is true of SharePoint governance. Without adequate policies, rules, roles, and responsibilities, your SharePoint implementation may be on the verge of collapse. The hidden problem at many organizations is that even if you initially create a sound governance policy, if it is not followed and enforced, it may be as bad as not having a governance policy at all. What may begin innocently enough as a minor exception to the rules or policies here and there, can quickly slide down the slippery slope to total chaos. Stay with me a minute.

While governance may seem to be filled with a lot of rigid policies and rules, governance, whether for SharePoint or for any other major system in your organization, must be flexible. It must be almost an organic thing. By that I mean that it must be adaptable to the changing needs of the organization. A change in needs might be something major such as the addition of outward facing web sites to your existing farm of collaboration sites, or it may be something as simple as a font change used for the body of content. The thing to remember is that all change sets a precedence. The question you should ask is whether that is a good precedence that you want to set or a bad one?

Once policies and rules are established and made part of the governance, then changes or exceptions to these policies need to be very carefully vetted and then added to the governance so it does not open the door to other changes that may be less desirable. For example, if the initial governance called for Arial fonts, exceptions might be made for other similar fonts such as Calibri. However, as soon as you begin allowing Times Roman or Impact, you are on the path to allowing Broadway, Playbill, and even Comic Sans Serif. But governance can stop that that by specifying that any font from a limited list of fonts can be used, but only one font can be used per page or per subsite. The same can be said for colors, page layouts (themes), and other visual effects. In addition, the governance should define a method to request changes to policies and rules. Perhaps those requests get funneled through a single person, but more likely they should go to a committee that decides on any governance changes.

However, even if you have all the policies, rules, roles and responsibilities well defined and you have a committee to review and approve or reject any changes to the governance, if you do not have upper management’s full support of the governance, your efforts are in vain. Like a dragon without fiery breath or at least large sharp teeth, users will soon realize that they can get away with anything with no more consequences than your pitiful cry to never do that again. You must have management’s support to enforce the governance.

The problem is that maybe management just came back from a conference, or perhaps it was from a free seminar provided by an alternative vendor who provided a lunch ‘n learn at the local steak restaurant. While the product demoed does basically the same thing as your current product, marketing demos are very good (or they should be) at highlighting the best features of the product making it look like the new and most desirable product on the market. Rarely does the demo go into how much time or skill was needed to set up the demo. Of course demos are meant to look flawless and so simply that your dog could do it with one paw tied behind its back. It is your job to be able to tell the difference between a well-crafted demo and a system that is truly easy to use before your organization starts spending money on something it really does not need.

But here is another big ugly secret. No system can be successful without governance and top management support. Individuals resist change. They will refuse to learn how to use a system. They will claim a system is too hard to learn or to use. They will blame their own mistakes on the system or on the restrictive governance policies or the lack of training that fits their exact circumstance. Management must lead from the top down with strong reasons why the policies and rules will serve the organization as well as the users better. Governance gives them a way to make that happen by formalizing the policies, rules, roles, and responsibilities of any major undertaking. Governance is leadership by getting out in front of everyone and saying, ‘Follow me and be successful.’

Well I’m off to another SQL Saturday this weekend again, this time in South Florida. How does this fit into today’s mini topic? At SQL Saturday there will be hundreds of individuals who as DBAs or developers use SQL Server and have policies and rules they follow when using SQL Server. They have specific roles defining what they can and what they cannot do. They have certain responsibilities to perform specific tasks using SQL Server. What makes SQL Server work or not work within their organizations is largely dependent on how well they embrace the governance around creating, using and maintaining databases and the processes that access those databases within their organizations.

You see it is not just about SharePoint or SQL Server or any other specific product. The need for governance applies across all products and activities in the organization as does the need for management support of that activity. No organization would arbitrarily switch from using SQL Server to Hadoop to Informix to ingres to Oracle to Postgres to Access or any other database system every couple of years. Rather, they would choose a product that best fits their needs and wring out all the value they can from it by insuring that their staff is trained and supported in its use. Don’t expect a new product to somehow produce magical results. If anyone is responsible for magically results, it is an organization’s people. What does your governance say about how you use technology and your staff?

C’ya next time.

Apr 22 2014

Data Warehouse Patterns

My friend Kevin Goff is doing a 1 hour webminar on April 23rd for SQL PASS (the DBA Virtual Chapter) on Data Warehouse Patterns with the MS SQL/BI tools. To attend, go to: https://attendee.gotowebinar.com/register/2504947330621706498.

Abstract:

You’ve read some of the content of well-known Data Warehousing books – now what? How do you take the structures and disciplines inside such books as the Ralph Kimball series and implement them? This session will take several of the key concepts in Data Warehousing and demonstrate some implementations using SQL Server Integration Services and Analysis Services. We’ll look at items such as Type 2 Changing Dimensions, Confirmed Dimensions, Dimension Outriggers, Role-Playing Relationships, Many-to-Many Bridge table relationships, Factless Fact tables, Handling of NULL foreign key values, and more.

Bio: Kevin S. Goff (http://www.kevinsgoff.net) is a Microsoft SQL Server MVP Database architect/developer/speaker/author. He is a regular columnist for CoDe Magazine on .NET, SQL Server, and Business Intelligence topics. He speaks frequently at community events in the Mid-Atlantic region. He also hosts BDBI Radio (http://www.bdbiradio.com), a weekly webcast on SQL/BI topics

Oct 27 2013

Trapping Errors in DAX

Several weeks ago I talked about finding errors in expressions because of the way DAX uses the operators in an expression to automatically convert values to either strings or numbers. I also looked at using the BLANK() function to find and account for missing values in data. But then I got sidetracked by preparing to go to the SQL PASS Summit. So now I want to finish this group of entries by examining how to trap errors and then either correct them or at the very least provide a value that will not cause additional calculation problems.

The first function I want to look at is the IFERROR() function. This function only has two parameters. The first parameter is the expression to test. If the expression evaluates without an error, IFERROR() returns that calculated value. However, if the expression results in an error, the function returns the second parameter. How does this work. Suppose you have a simple expression such as the following for a calculated column named DiscountAmount representing the discount amount of the item:

= PURCHASEORDER[ListPrice] * PURCHASEORDER[DiscountPercent]
In most cases, this expression would evaluate correctly. Even if the DiscountPercent or ListPrice values were left blank, we learned when discussion blank data that DAX assumes a value of '0' which in this case could still be considered correct. However, what if one of the values in the DiscountPercent column was not numeric and not a blank? What if the value was a string such as a DiscountPercent value being 'NA'? Then DAX would register an error for the calculated column and not display any of the calculated values. Resolving this problem by visually examining the dataset may work when you have a small dataset with only a few records because you can scroll through the data and quickly find the string value that caused the problem. However, if you have millions of data rows, finding the cause of the error may be just a bit more difficult and time consuming. With DAX we can easily solve this problem by using the IFERROR() function in the expression as shown below:
= IFERROR( PURCHASEORDER[ListPrice] * PURCHASEORDER[DiscountPercent], 0)

In this case the IFERROR() function tests the value of the expression and if an error occurs because one or both values in the expression cannot be converted to numeric values, it traps the error and returns the value of the second parameter, rather than returning the product of the expression in the first parameter. On the other hand, returning a value of ‘0’ may not be what you want. After all, a value of ‘0’ implies that no discount is applied. However, in this case, we really don’t know why the discount amount is not calculated, only that it results in an error. Therefore, you may want to return a blank value as in the following expression.

= IFERROR( PURCHASEORDER[ListPrice] * PURCHASEORDER[DiscountPercent], BLANK() )

Using the IFERROR() function works best if the tested expression is also the expression that you want to return if the test expression does not generate an error. But what if the test expression is different from the expression that should be used when the test expression is true? In this case you want to use the ISERROR() function to test for an error. The following expression tests for an error when multiplying the ListPrice times the DiscountPercent in table PURCHASEORDER.

= ISERROR( PURCHASEORDER[ListPrice] * PURCHASEORDER[DiscountPercent] )

By itself, this expression just returns a value of TRUE or FALSE. To use this in an expression that returns something other than just the product of these two values, I will use the IF() function to define what value to return based on the Boolean result of the ISERROR() function. If ISERROR() returns a value of true indicating an error, the IF() expression returns the value of the second parameter. Otherwise, it returns the value in the third parameter.

= IF(ISERROR( PURCHASEORDER[ListPrice] * PURCHASEORDER[DiscountPercent] ), 0, PURCHASEORDER[ListPrice] * (1 - PURCHASEORDER[DiscountPercent]) * PURCHASEORDER[Quantity] )

Testing your expressions may seem much more complex than just entering the expressions you want to calculate. You may also think that performing these tests against every row of a large table must also degrade performance, and you would be correct. In practical use, I do not include a lot of test functions such as those described here unless I encounter a problem with a calculated column or suspect that the data in the column may not be clean. If you cannot trust your data sources, then you really don’t have much choice but to test. Perhaps an alternate approach to adding IFERROR(), ISERROR(), or IF() functions in the calculations is to perform data cleansing first using the DQS tool that is now part of SQL Server. Of course setting up a DQS knowledgebase and then using it to find and correct data issues can also be time consuming. You may even want to consider the application that captures the data to tighten the validation rules that check the data entered into that system trapping errors as they are entered rather than trying to identify and fix them later.

C’ya next time.

Oct 21 2013

What Have I Been Up To?

Last week I was at the PASS Summit in Charlotte, North Carolina. It was my second time attending the summit and also my second time speaking at the summit. I think they said that this was something like the 15^th year for the summit. There is nothing like getting together with a couple thousand other professionals who have professional and/or personal interests in the same things you do. It is an opportunity to talk with collegues that you may only communicate with through email or twitter throughout the rest of the year. It is also a chance to reconnect with existing friends and to make new friends who share the same interests as you.

This year, the number of sessions dedicated to business intelligence and the Microsoft BI stack has grown. Part of that is that the BI stack itself has expanded with everything from the tabular model in SSAS to augment the multidimensional model to Data Quality Services (DQS), Master Data Services (MDS), Power Pivot, Power View, Power Map, Power Query and Power Q&A. (If you would like to try some of the latest members of the BI stack, go to: http://office.microsoft.com/en-us/excel/power-bi-FX104080667.aspx ). Of course, PASS also supports a separate BI conference that is scheduled for the spring. However, the interest in business intelligence was especially interesting to me because of my personal interest in the entire stack of tools since the introduction of PowerPivot several years ago.

Personally, I’m looking forward to one more SQL Saturday this year, the Tampa SQL Saturday, BI Edition (http://www.sqlsaturday.com/248/eventhome.aspx ) which happens on November 9^th. I will also be speaking there so if any of you have an interest in BI topics, come out to Tampa in a little over 2 weeks and say, ‘Hi!’. The BI edition of SQL Saturday was founded in Tampa several years ago by Jose Chinchilla and I’m glad to have been a part of each subsequent annual event.

Other events that are coming up are the monthly meetings of both Orlando SQL groups. The south group, MagicPASS (http://magicpass.sqlpass.org/ although I don’t think the web site has been updated), will be meeting in Celebration, FL on Wednesday, October 23^rd and the north group, OrlandoPASS (http://orlando.sqlpass.org/ ), will be meeting in Sanford, FL on Thursday, October 24^th. If you are more into SharePoint, there is a meeting of the Tampa SharePoint group on Tuesday, October 22nd at 6:00 PM to 8:30 PM at the Microsoft Office, Tampa (https://www.eventbrite.com/event/8853808981/). It is a busy week as you can see.

The other news that you may have already heard is that the CTP2 of SQL Server 2014 was released last week during the PASS Summit. I have not looked into 2014 yet so I don’t have a recommendation one way or the other. However, if you have processes that would benefit from improved performance, the new Hekaton engine in SQL 2014 may be able to provide some performance improvement with very little effort (read that as no program changes). During the Summit, the Thursday keynote was presented by David DeWitt who tried to explain the technology behind how Hekaton achieves its performance gains. Most people left the keynote with their heads hurting trying to understand the magic behind the technology. However, it seems like the magic might be summarized by the combination of the elimination of latches and locks thus reducing the amount of contention when accessing data along with efficiencies with the new column store method allowing more data to be read into memory reducing disk access for many operations. Of course that may be too simple and probably incomplete. Afterall, my head started to hurt as well.

Well, that’s it for this week. I guess it was more of a summary of what I’ve been up to and why I did not have much time to post any blogs for the last two weeks. I’m sure my text here does not even begin to do justice to the excitement of the summit, so perhaps it would be better if I just left you to go out to the summit site and watch some of the interviews and keynote talks here: http://www.sqlpass.org/summit/2013/PASStv.aspx.

C’ya next time.

Aug 31 2013

A Short Guide To Installing MDS 2012

Before I can talk about using MDS (Master Data Services), I feel a small responsibility to help you get MDS up and running. Although MDS has been available since SQL Server 2008 R2, the method of installation has changed from the use of a separate installer file (.msi) to be baked into the installation wizard of SQL Server itself. . Fortunately, that change has been to make it easier. However, I will focus on the SQL Server 2012 installation here.

If you did not install all Shared Features, you may want to reinsert your SQL installation disk and step through the screens until you reach the Feature Selection screen as shown in the following figure. If the Master Data Services option is already selected, that means that you previously installed MDS as part of your current SQL Server installation. If it is not selected, you must select this Shared Feature and then proceed through the rest of the installation screens to add the necessary bits to your server.

After MDS has been installed, you must configure it. To do this, open the Master Data Services Configure Manager. You can find this application in the Microsoft SQL Server 2010 folder under the Master Data Services folder. Note, before you can run MDS Configuration, a check for prerequisites is made on the machine. MDS requires Windows PowerShell as well as Internet Information Services. In fact, if IIS is not installed, selecting MDS during the SQL install may not install MDS correctly and you will be prompted to reinstall it.

Assuming that your machine has all of the necessary prerequisites, you can proceed to the next step. The configuration manager consists of two pages of properties. The first page is the Database Configuration. Just like DQS (Data Quality Services) you must select or create a MDS database that MDS can use for its working area. To do this, you must define the database instance and the specific database to be used for Master Data Services. For now, we can leave the other System Settings at their default values.

The second page of the configuration manager prompts you to create the web site to be used with MDS (which is why you need IIS on the machine you are running MDS. You must also identify the database created on the first page of the configuration manager. Finally, you choose whether you want to integration your MDS installation with Data Quality Services (DQS). This feature only exists in SQL Server 2012 because DQS is new to SQL Server 2012.

At this point, you should be able to start up MDS. The main screen shown below adds a feature in 2012 that did not exist in the 2008 R2 version, the ability to use an Add-in for Microsoft Excel to help you load and administer your data.

WOW, it sure seems like Microsoft Excel is getting used throughout more and more of the Microsoft BI stack. First PowerPivot and Power View, then the tabular model in SSAS and now MDS. If you never took the time to learn Excel because you thought you were a SQL guru and did not need Excel, perhaps now is the time to brush up on your Excel knowledge.

Well, I going to stop here for this week so you can get MDS set up on your machine. Besides, it is Labor Day weekend and most of my United States readers probably have better things to do this weekend.

C’ya next time.

Aug 24 2013

MDS: The New DQS? Not Really!

First, you need to know that the MDS I’m talking about here is Master Data Services introduced in SQL Server 2008 R2. I’ve already talked about Data Quality Services (DQS) which I believe was only introduced in SQL Server 2012. So the natural question is what is MDS and why do I need it together with the newer DQS or is DQS just a replacement for MDS and can we forget about learning MDS now that we have DQS?

Previously I talked about DQS and showed how you can use it to create individual knowledgebase domains that can then be used to clean data. The application of DQS on a regular based to clean data going into a production database should improve your overall efficiency compared to manually cleaning the same data. This is because DQS remember the correction you made previously. You can also define rules for ‘fixing’ individual domains. Efficiency improves over time because the knowledge base understands how to handle an increasing number of problems without requiring manual intervention. However, at the same time, users have to be realistic. There will probably never come a time when absolute all the possible data issues have been addressed allowing for fully automatic data cleansing.

You could think of DQS as the detergent that helps you clean bad data from a data source getting rid of invalid and misspelled data. It can also help you reduce duplicate records within that individual data source. It can even output a data file of cleansed information with duplicate records removed or at least marked. However DQS provides no assistance in merging the clean output data file with another data source. The ability to combine a freshly cleaned data source with a master data file is what Master Data Services can provide. To some extent, both products have the ability to cleanse and match data from an input data source. However, the rules in DQS are domain (column) centric while the rules in MDS are typically more record centric. Also MDS as the name implies helps you import data from new sources into a common master table. Another way of saying this is that MDS might be the gatekeeping of your enterprise data system that controls what data can or cannot be added to your data warehouse.

Because both DQS and MDS provide the ability to define cleansing rules, users may jump to the conclusion that these two products are mutually exclusive. However, the purpose of the cleansing rules between these two products needs to be clarified. The purpose of the cleansing rules in DQS is primarily to correct and validate values added to each domain. DQS rules can determine which values are allowed or not allo0wed in the domain. These rules are not record based, but column based.

On the other hand, the business rules created in MDS primarily exist to prevent invalid records from being added to a master data table. In other words, it keeps new records from entering or updating the master set of records which may sit in your enterprise data warehouse.

Based on this description, you should see that DQS and MDS are clearly different products. Furthermore, you may not always need to use both of them. If your purpose is merely to clean bad data from a data entry system and possibly remove duplicate records, you may be perfectly satisfied with using only DQS. Similarly, if your task is to merge clean data from multiple data sources such as sales data from different countries, you may be more interested in the abilities of MDS to combine records based on business rules that prevent record duplication while allowing data to be merged. Of course if the data sources that you want to merge have not first been cleaned, you may need to first run them through a DQS project with a corresponding knowledgebase to ‘fix’ any domain issues such as different product codes for the same product in different countries or perhaps even differences in the way customer address information is recorded.

In future weeks, I will take you an exploration of MDS in between some other technical topics that I want to cover. Before I end today, I want to remind all my readers of two events coming up very soon where I can meet you.

The first is the SQL Saturday Orlando event coming up on September 14^th. SQL Saturday events started right here in Orlando back in November 2007 by Andy Warren. I’ve been fortunate to have been a speaker at every Orlando event. This year I will be presenting an introduction to DQS that I call, ‘Putting Quality Back Into Your Data’.

The second big event coming up is the 2013 PASS Summit that this year is being held in Charlotte, North Carolina during the week of October 15-18. This will be my second year speaking at the PASS Summit. If you have never been to a PASS Summit, you are missing the opportunity to meet, talk with and listen to presentations from SQL professionals from around the world. I feel very honored to be able to give two presentations this year, one on DQS and the other on using SSAS’s Tabular Model to build Cubes that you can manipulate as Excel PowerPivot models.

If you have the opportunity to get to either of these events, please stop by to say Hello. I would love to meet you all.

C’ya next time.

Aug 3 2013

Create a Matching Policy – 2

Last week I left you with the thought that an exact match can be either a prerequisite or just one of several domains used to match records. The purpose of using an exact match as a prerequisite is that the domain is so crucial to determining whether two or more records match that if the value of this domain is different between two records, there is no possibility for the records to match. Therefore, there is no need to check any of the other domains when this prerequisite domain is different. This may seem counter-intuitive at first, but we must remember that we are not looking for unique records here, but matching records and eliminating records by using a prerequisite will speed up the matching process.

While you can have more than one prerequisite domain, not every domain that you are trying to match on should be a prerequisite field. In fact, DQS requires that you have at least one domain which is not a prerequisite. Fortunately, fields like business name and business address should not be prerequisites because there is a high probability that these could be misspelled or formatted differently. For example, Adani & Baez Computers Inc. is probably the same business as Adani and Baez Computers Inc. and the same as Adani & Baez Inc. even though they are not exact matches. If we used the Business Name field as a prerequisite exact match, these three names would not be considered a match. A similar argument could be made for Business Address and even Business City.

Therefore, we would add these additional domain elements in our matching rule. Each domain element other than any prerequisite domains must have a weight factor. The weight factor determines how important matching domain values are in determining whether two records are a match. Initially, you may have no real preference for how to weight the additional domain values. A good place to start however is to make all the domain weights equal. Therefore, if you have five additional domains, make each one 20%. Note the total of the weighting factors must add to 100%.

After evaluating all the domains for each set of records, DQS calculates a number between 0 and 100% to represent the chance that the two records are a match. Of course, a score of 100 would indicate an exact match across all the domains used in the rule. Of course, most pairs of records will probably have a value less than 100. But how much less than 100 still indicates a match? DQS uses a Minimum Matching Score to filter the potential matches. For reasons beyond the scope of this blog, DQS places a minimum score limit of 80% on matching records. While I can change the Minimum Matching Score to any value between 80 and 100, I cannot set it any lower than 80. This value can be found below the Rule Name and Description as seen in the following figure.

Another option you must set is whether a record can appear in only a single matching set of records or whether it can belong to more than one set of matching records. The first option is called non overlapping clusters and the second option is called overlapping clusters. With non overlapping clusters, each record appears in only one cluster or is considered a unique record. This is the option I would usually select if my goal was to eliminate matching records. However, using overlapping clusters might be a good way to analyze for other relationships between records. For my analysis here, I will use non overlapping records.

Next, I click the Start button shown in the previous figure to begin the analysis. This process can take a substantial amount of time if I am working with a large database table. Therefore I recommend creating a smaller representative table to test out my matching rule definition before running the final rule against my full table. After the matching process completes, a grid appears at the bottom of the screen. In this grid, which is filtered on the matching records, you can see how DQS has grouped similar records based on the matching rule. Each matching groups is called a cluster and is given a cluster id that has the same value as the first record id added to the cluster.

Clicking on the Profiler tab provides some interesting results. In the figure below, I can see that my test table had nearly 8000 records. In the grid to the right, I can see the number of new and unique values for each of my domains. New values are domain values that I may want to go back to add into the knowledge databases for the domains if they truly are unique and not just bad data values. The Unique column tells me how many of the 8000 records had unique domain values. I can see that there are only 5129 unique business names. That means that I should have nearly 3000 potential duplicate records based only on the Business Name. Looking at Business Address, there are 6683 unique addresses. This tells me that there are at least some business records with the same business name, but different addresses. This is entirely possible if the business has more than one location and may not be an error. On the other hand, of 8000 records, there are only 224 unique city values, 32 unique states, and 465 unique zip codes.

The low rate of unique values for these three fields tells me that their contribution in determining matching records is low. At the same time, it makes each of these domains candidates for an exact match or even a prerequisite match. In fact, that is why I selected Business States to be an exact prerequisite match.

If I previously cleaned my data elements to remove inconsistent abbreviations and spelling problems, I would want to look at the fields with the most unique values to be potentially more important when determining matching records. However, just because a domain has a high uniqueness value, does not make it a good matching value. Take Business Phone as an example of a poor domain to determine matching records. Because a business could have more than one phone number at a given address, different phone numbers do not necessarily mean that the records are not a match. Even a different address may not be as important as having the same business name. Therefore, I use the uniqueness values as a guide, not an absolute to determine how I might want to reassign weights to each domain in the rule instead of weighing them all equally.

Another thing to consider is to examine each cluster and determine why some of the matching scores are less than 100%. In the figure below, I have highlighted a record with only a 96% match score. To find out why this score is not 100% I can right click on the row to display the details.

The Matching Score Details dialog shows the two records that are being compared at the top with the overall matching score. In the second grid, I see a grid of the other domain with their contributing score. In the figure below, I can see that phone number score was only 79%. However, looking at the two phone number I can see that difference was most likely a key entry mistake. In any case, the 79% score on this domain is only worth 20% of the total matching score. Since the other domains are exact matches, I have:

0.2 * 100 + 0.2 * 100 + 0.2 * 100 + 0.2 * 100 + 0.2 * 79 = 96

But perhaps more importantly, this tells me that the phone number domain may not have been cleaned. If I do not have a reliable way to do this without spending a lot of time or money, I may simply want to deemphasize the importance of the Business Phone domain and rerun the matching rule.

Finally when I am done, I can proceed to the next page of the defining the matching policy where I will run the rule one last time.

Clicking Finish will allow me to publish the matching rule back to the Knowledge Base as shown in the following dialog.

After the publishing process completes, the final dialog informs me of its success as shown below.

I now have a matching policy saved with the business address domains that I can use in a project with the full database table.

C’ya next time.

Jul 27 2013

Create a Matching Policy – 1

No, this is not an on-line dating service. Rather, I am trying to analyze a table with duplicate records. These duplicate records can be just as bad for my pivot table analysis as having data with bad values. Earlier this year I showed the basic of how to use Data Quality Services in SQL Server 2012 to create Knowledge Bases that could then be used to clean other data tables. However, cleaning bad data out of my analysis tables does not alone guarantee good results. Another potential problem with data could be duplicate data.

Duplicate data can come about for a variety of reasons. Someone may load the same table more than one time into a master table. Depending on the way indexes are defined in the master table, an error may not be thrown. How does that happen if I have a primary key? Perhaps my master table generates a unique surrogate key for each record added to the table rather than use the existing primary keys. This technique is often used when adding data from multiple data sources in which the unique primary key in those data sources may no longer be unique across all of the combined data. Therefore, I cannot use the primary key from the individual data sources in the master table and instead generate a unique surrogate key.

Another reason might be that I am pulling data from a legacy sales system that recorded the customer’s address with every purchase. If any one customer made two or more purchases, their address data would appear multiple times. In the new system, I want to maintain a central customer table to store the customer’s information once. Therefore, I have to identify the duplicate customer addresses and save only one address per customer.

I might also have a table of businesses who have contributed to my charity. Over the years, I may have received donations from the same business multiple times. Therefore, there are multiple records in the contributions table, one for each year. Now I want to consolidate the businesses that I have received donations from so that they only appear once in the final table

Let’s assume that I previously loaded a master business table and I have already gone through the DQS process to create a clean (or as clean as reasonably possible) knowledge base. The next step is to open the DQS Client and click on the arrow to the right of my Knowledge Base (Business Master in this case) to open the popup menu shown in the next figure.

The next step is to select a table to use to define and test a matching policy. This table can be the original table used by the knowledge discovery step to create the original knowledge base. Alternately, it can be the table in which I want to search for duplicates. Even if I opt to use the table that contains the duplicates that I want to fix, I often create a representative subset of the full table as a test table. I do this so that defining and testing the matching policy can execute faster. Just be careful when selecting a representative subset of data to not skew the analysis by accidentally selecting data that came only from a single data source and thus may not show the duplicates. For this reason, I would probably select all of the business in which the business name began with a specific letter or two. I could also use business type (if that were a field in my data) to select only businesses of a specific type. Any way to narrow down the number of records will help testing the matching policy execute faster.

In the follow screen, you see that I have selected the Business table from the Customer database in SQL Server. Note that DQS can also work with data from an Excel spreadsheet as the data source.

Next in the Mappings section of the screen, I need to select all the source columns that I want to use in my matching policy. Note that I must use at least one column and could use all of the columns in the source table. However, realistically, a smaller subset of columns are used to define a matching policy than may be used during full data cleansing.

For each source column selected, I must select a Domain from my Knowledge base to use. The way DQS uses the domain is to clean the selected column data against the knowledge base domain before attempting to match records. By cleaning the data first, the matching process is more accurate.

After selecting all the columns I want to use in the matching policy, I click Next in the lower right to go to the Matching Policy definition page. Of course, no matching policy exists yet so I must click the Create a matching policy icon on the left side shown in the next figure to create a matching policy rule. Note that there is only a single rule per matching policy. After clicking the option to create a matching rule, the icon to add a matching rule will disable.

By default, the Matching Rule gets the name Matching Rule 1. However, you can change this name in the Rule Details section as shown below. You can also provide a description for the matching rule. While I usually add a description that defines my ‘strategy’ for the matching in this area, I will leave it blank for this illustration. You can also select the minimum matching score for two records to be considered a match. The default value of 80 is the minimum score that DQS allows. However, you can set the score to a higher value.

On the right side of the figure, you can see that the Rule Editor is looking for at least one domain element for the rule. You can add your first domain element by clicking the Add New Domain Element icon to the top right of this area.

For this example, I chose to add Business State as my first domain element. When defining matching rules, a domain can be either an exact match or a similar match. An exact match, as implied by the name, must be spelled exactly the same in both records. On the other hand, a similar match does not have to be spelled the same. Microsoft uses an internal algorithm to assign a value from 0-100 to each set of matching values based on the similarity. DQS uses this value to calculate the overall matching score.

One twist to the exact match is that it can be viewed two different ways. You could say that an exact match is a prerequisite for the two records to be considered a match. When Exact is used as a Prerequisite, if the two values are not exact, DQS does not even evaluate the other domains in the rule to see if they match because quite frankly, it does not matter. On the other hand, if Exact is not used as a prerequisite, then the value for this domain comparison will be either 0 or 100, but DQS will continue to evaluate the other domains in the rule. After adding all the domain match values together, if the weighted average is greater than the minimum matching score, the records can still be considered a match, even if the domain with the exact match requirement fails.

Ok, I’m going to stop here for this week because that last paragraph may sound a bit confusing and I want to give you a few days to think about it so that it begins to make sense. I’ll mention it again next week as we continue defining the matching policy and show how Exact can work two different ways.

C’ya next time.

May 4 2013

Its Good To Be Regular!

It’s good to be a regular expression that is. (What did you think I meant? 😉 ) Anyway, last week we talked about using patterns when defining domain rules. While pattern matching can solve some problems, it cannot solve all problems. For example, suppose you want a 4-character string that must begin with a letter between A-F. Pattern matching may help you look for letters in a string, but it cannot limit which characters are acceptable. (An ‘extreme’ case you may not have thought about since last week is that a character can be a numeric digit or symbol, but a number in a pattern cannot be a non-numeric character.) Another good example is when I want the user to enter a hex code for a color. Hex codes range from 00 to FF.

The way to define a domain rule for these situations is not by using a pattern. Rather, a regular expression lets you control the specific characters allowed at every position in a string and can be extremely flexible. First, let’s look at some of the rules as they might apply to a specific example.

Let’s go back to that first case where I want a 4-character string that begins with a letter between A-F. I can begin the regular expression with the string: [A-F]. This would allow the string to begin with any character from ‘A’ through ‘F’. However, the string definition of regular expressions would tell me that I really need to use [A-Fa-f] so that the user could enter either upper case or lower case letters. While that is true for applications development using regular expressions to validate input, DQS treats the comparisons as case insensitive and so you can use either [A-F], [a-f], or [A-Fa-f].

Note that the text within the closed bracket represents just one character position even through several characters may appear. If I wanted to validate against a non-sequential set of characters such as in [ABEFMPST], that would be a valid way to insure that the character of the domain value is one of these eight letters.

If I want to allow most letters in the alphabet with the exception of only a few, I could specify the characters not allowed in the character position using an expression like [^IOQ]. This expression would allow any character except the letters ‘I’, ‘O’, or ‘Q’. By itself, this would also allow numeric digits so I may want to use [^IOQ0-9] instead.

Everything I talked about so far only applies to the first character. In my example, I want the remaining three characters to be numbers. I could change my regular expression to: [A-F][0-9][0-9][0-9]. However because the second through fourth characters are defined the same way, I can use the following expression to indicate that I want to use the same character definition for the next three characters.: [A-F][0-9]{3}. The number 3 in the curly brackets indicates that the previous character expression should be repeated for three characters.

Interestingly enough, this regular expression would also look for four consecutive characters in a larger string that began with a letter from A-F and was then followed by 3 digits. In fact, it would declare the following value to be valid: 45 Main St, Ste D104. You see, by itself, the regular expression is available to match characters anywhere within a string. If I want to force the expression to match string values that begin with a specific sequence, I must start the string with the caret character as in: ^[A-F][0-9]{3}. With this string as the regular expression, the above address would not be considered a valid match.

What if I don’t know how many time a character definition needs to be repeated? I could use any of the following:

^[A-F][0-9]* This allows for zero or more numbers after the letter

^[A-F][0-9]+ This allows for one or more numbers after the letter

^[A-F][0-9]? This allows for zero or one numbers after the letter

^[A-F][0-9]{3,} This allows for at least 3 numbers after the letter.

^[A-F][0-9]{3,6} This allows for at least 3 but no more than 6 numbers after the letter.

So I might think that I could use ^[A-F][0-9]{3,3} to insure that valid values began with a letter followed by three and only three numbers. Unfortunately, it does not work like this. Rather, there is another character that I can add to the end of an expression that basically says that the string must end with the defined expression. That character is the dollar sign. Therefore, I could use ^[A-F][0-9]{3}$ to insure that the string only has four characters and that those characters begin with a letter A-F followed by three digits.

Let me say that what I have covered here is just the tip of the iceberg when it comes to regular expression capabilities. There is much more that I can do with expressions. However, my emphasis is to cover BI related topics such as PowerPivot, SSAS and DQS, not to go off on a multi-week tangent about regular expressions. Therefore, I’m going to give you a few references to let you explore the richness of regular expressions on your own.

As I said, this has just been but a brief introduction into the world of regular expressions. There are many sites that will teach you how to build regular expressions. A good place to start might be http://www.regular-expressions.info/tutorial.html.

In closing, you may want to download a free tool that will help you discover how regular expressions work. The tool name is EditPad Pro 7 and can be downloaded from http:/www.editpadpro.com/download.html.

C’ya next time when I take a look at matching in DQS projects.

Mike

Apr 28 2013

I Think I See A Pattern

Last week I mentioned a little about domain rules but other than using a few simple ones, I did not mention a lot of detail. Well, just for the record, the following list gives you an idea of all of the different Boolean expression types I can build with a domain rule:

Length is equal to
Length is greater than or equal to
Length is less than or equal to
Value is equal to
Value is not equal to
Value contains
Values does not contain
Value begins with
Value ends with
Value is numeric
Value is date\time
Value is in
Value is not in
Value matches pattern
Values does not match pattern
Value contains pattern
Value does not contain pattern
Value matches regular expression
Value does not match a regular expression

Most of these are (or should be) self-explanatory. However, this week I want to take a quick look at the pattern matching expressions.

First, these are not the same as building a regular expression. I will cover them in the future. Rather a pattern is much simpler. In fact, there are only four basic rules to building a pattern. They are:

You can define a position in the string to represent a letter
You can define a position in the string to represent a number
You can hard-code a position in the string to be a specific special character such as a hyphen
You can define a position in the string to be an optional letter or number

For example, one of the patterns I might want to match is for a zip code in the United States:

(Note: I can use any numeric digit 0-9 to represent a number in any position of the string. Therefore the pattern: 00000 is the same as 12345 is the same as 99999. They all merely represent a string of five digits.)

This pattern implies that the user must enter a string of 5 digits, not 4 and not 6. Five digits and only five digits will be accepted.

This means that the following zip code would not be accepted using this method:

1234

180922

1B7C4T

C4G3E1

The first example is only four digits and is short by one digit. The second example has 6 digits and has one too many digits. The remaining two examples include letters and numbers which while valid in some countries is not valid in the United States. To allow these I would need a patter like:

However, some people may want to enter their zip plus 4 zip code. To do this, I need to allow for the hyphen between the fifth and the sixth digit. Therefore, I might use something like:

That pattern would be great if everyone entered their full zip plus 4 zip code. However, not everyone knows what their full zip code is and while using the full zip code makes the sorting and delivery of mail more efficient, it is not yet required. Unfortunately, if I specified this pattern for my zip code field domain rule, records with only the standard 5-digit zip code values would fail the rule.

In order to give the user an option as to whether to use either their 5-digit zip code or their 9-digit zip code I could do this:

Notice that the two separate clauses for 5-digit zip code and 9-digit zip codes are joined with an ‘or’ connector. It should be obvious that an ‘and’ connector would not be appropriate in this case. Anyway, there is a simpler way to accomplish the same rule using a single clause by defining the last 5 characters (the hyphen and the 4 digits) as optional as shown in the following figure.

As seen in this example, I can combine multiple character positions within parenthesis to indicate that they are optional, that includes special characters.

While all of the above examples defined patterns for the entire string to match, I can also define patterns that the domain values must not match for the values to be valid. Similarly, I can use pattern matching to match a portion of the string by using the value contains pattern option. Note that values that begin or end with the matching pattern would be valid. The pattern could even be in the middle of the string. I can even evaluate strings to be valid if they do not contain a specific pattern anywhere within the string.

However simple pattern matching may be, it has its limitations. The biggest limitation is that I cannot specify a specific character in a specific position. For example, I cannot specify a 4-character string in which the first letter can only be ‘A’, ‘B’, ‘C’, or ‘D’. This type of comparison and many more require the greater complexity of the regular expression clause which I will look at next time.

Until then, c’ya!

SharePointMike's Blog

Your one stop place to learn more about ADA, SQL, Excel, Power Pivot, and Power BI too, especially for Data Analysis

Category SQL Server

Governance For All

Data Warehouse Patterns

Abstract:

Trapping Errors in DAX

What Have I Been Up To?

A Short Guide To Installing MDS 2012

MDS: The New DQS? Not Really!

Create a Matching Policy – 2

Create a Matching Policy – 1

Its Good To Be Regular!

I Think I See A Pattern