New Research: Similars Attract! Really?

A recent article in the Los Angeles Times from a research group at Binghamton University Binghamton N.Y. reported substantial evidence that the company we keep in high school affects everything from our happiness, weight, grades, and other factors.  What I find interesting is that this conclusion should not come as a surprise to anyone.  Other than close neighbors that we may have grown up with, most people tend to associate with others having similar interests and goals in life.  In fact, the researchers did go so far as to say that most people have best friends with very similar grades and habits.  I’m thinking why would anyone think that kids would have as their best friend someone who they were jealous of or maybe despised or hated?

However, the interesting part was the conclusion that a person with grades lower than the average of a circle of friends they associated with tended to improve their grades over the course of the study time.  It also was true that the smartest members of the group tended to drop in their grades or class ranks slightly to bring them down to the group average.  Sort of makes me think that if you really want to succeed, you need to start hanging around with others who have succeeded.  But then you might also start to think that if you are already at the top of your class in whatever, that associating with a group that was shall we say not as gifted may not be in your best interests.

Where else have we seen this type of behavior?  Is it not better to buy the lowest price house in a good neighborhood than the highest price house in a poor neighborhood even if both houses cost the same amount?  Do you feel like you got more of a bargain by buying an item on sale at that high priced boutique than paying the regular price for the same item at a discount store even if both prices were exactly the same?

Ok, maybe I am getting a bit off topic here.  The point I want to make is that don’t we all tend to associate with others that are similar to us?  If we want to improve something about ourselves, don’t we tend to look for groups that have the trait(s) we want?  If we want to become healthier or lose weight, do we not associate ourselves with others who exercise more or eat healthier?  If we want to be better at playing a sport, do we not try to play with or even against others that we recognize as better than us in that sport?  Competition can lead to motivation, and motivation can lead to success.  So if we want to do better in school or even at work, shouldn’t we try to hang out with smart kids rather than mock them or try to shove them into their lockers?

This does not mean that family and environment are not factors in success.  They certainly set the stage for whether you can succeed or not.  Becoming the next great basketball star may not be possible if your parents won’t let you play sports.  Similarly, no matter how smart you are, you might not become the next Steve Jobs or Bill Gates if you don’t have a family environment that lets you pursue your dreams and instead requires you to go to work after high school to help support the family.  Rather, what the report does seem to say is that you can make incremental improvements to whatever trait you are trying to change by carefully selecting with whom you associate.

On the other hand, don’t be afraid to explore other friends and activities that may be on or beyond the fringe of what your norm is.  While they may not all lead you to the ultimate goal you want to achieve, they will help round out your personality and give you a better understanding of other people and the activities that are important to them.  How much has your social network reflected this concept?  Has the growth of on-line digital communities or social network sites changed this?  Probably not.  So if you want to improve something about yourself, find a social networking site that will challenge you to reach your goal.

BTW, This is my 200th post.  Two years ago I never thought I would get this far.

C’ya.

DQS – Cleaning a Data File With Your New DQS Domain

With my initial customer domain created, I am ready to start cleaning customer names in Excel files or SQL tables.  To begin, I open the DQS client home screen and select the option: New Data Quality Project from the middle group of options as shown in the following figure.

DQS028

 

For a new project, I need to name the project and optionally provide a description to explain what the project will accomplish.  I then specify the name of the Knowledge Base I want to use, in this case: Customers.  DQS can now show the available domains in the panel on the right side of the page as shown in the following figure.  For my activity, I select Cleaning because I want to use my selected domain to clean the values in my project source table.

DQS030

 

After clicking Next, I need to select my data source and then map the source column to an existing domain.  This screen is similar to the one used to discover the domain values that I covered two weeks ago.  As with data discovery, cleansing can work against either SQL Server data or an Excel spreadsheet.  In the case of an Excel spreadsheet, I must specify the location of the Excel file.  For a SQL Server source, I need to specify the database and the table.  Note that the database must be on the DQS Server itself.

Next, I need to define the column mappings beginning with the Source Column from my data source and associating it with a domain from my selected Knowledge Base.  I can do this with multiple source columns each mapping to its own knowledge base domain.  However, each domain can only be mapped to a single source column within a project, but you could have multiple projects to map a single domain to multiple source columns, one column per project.

 

DQS030b

 

Clicking Next to move to the next page of the project dialog, I can begin the data cleansing by clicking the Start button.

DQS031

After a relatively small amount of time, the Cleanse page displays the results of the processing including the number of records processed, the number of correct values from the source, the number of corrected values, the number of suggestions and the number of invalid records.  DQS uses an algorithm to determine how confident it is of the correction.  This means that it sometimes can correct values that do not appear in the domain, but are similar enough that DQS thinks it may know how to correct the value.  I can review the suggested changes as well as corrections on a subsequent screen.

DQS032

 

Clicking Next takes me to the Manage and View results screen.  On this screen, I can use the tabs in the right panel to see the suggested changes and decide whether to allow the change or to reject it.  I can also see the number of new values and what those values are.  (I may want to add the new values to my domain for future cleansing operations.)  I can also see grids of the invalid values as well as corrected values and even all of the correct records that did not need to be updated.

DQS033

Using each tab, I can review the changes made to the data and even substitute an updated Correct to value if necessary.  After reviewing the results, I click the Next button one last time to display the Export page.  On this page I begin by selecting the Destination.  Again like readying source data I only have two choices currently, SQL Server and Excel.  If I choose Excel as shown in the following figure, I need to specify the URL of the Excel file that DQS creates.  By selecting only the option button Data Only, DQS replaces the existing values in the selected Excel column with new cleansed values.  The option to export data and cleansing info adds columns to the output Excel spreadsheet to show before and after values as well as additional information about whether the data was already correct or corrected and the confidence level of any corrections.

Of course, clicking Export to complete the process creates the output Excel spreadsheet as shown in the following figure and the Finish button ends the project activity.

DQS034

So now you can create a basic cleansing domain and use it to clean data sources consisting of spreadsheets and SQL tables.  But DQS can do more than just cleanse data.  In future weeks, I will return to DQS to show more features that you can use to ‘Clean Up Your Data’ so that BI analysis, reports, cubes, and pivot tables are more accurate and concise.

 

C’ya next time.

It Is Just Too Hard … NOT!

I saw my first glimpse of Windows 8 over six months ago at a local SharePoint Saturday event and shortly after that downloaded and installed the beta on my laptop to start getting familiar with it.  I cannot tell you that there were no challenges.  The first thing I noticed was that the original beta version would not work with my wireless.  However, that was a known issue.  I also had trouble finding where some features were located such as the Control Panel and basic screen settings.  Fortunately I knew from the demo that the Start button was gone and that you could easily get to the desktop by clicking the Desktop tile on the new interface or moving the cursor to the lower left corner to display a popup image of the desktop.  I later found that the keyboard Windows button also toggled me to and from the desktop that has become my method of choice since then.  My biggest frustration though was attempting to shut down the machine that first night.  Indeed, it took me an embarrassing around of time to find the power off option (without just flipping the power switch), but I did learn where other things were in the process.

Would I say that the process was too hard?  No more so than using any other new device.  I am sure we have all gone through the process of learning our way around our new smart phone.  Some of us may have even repeated this exercise a few times using Blackberries, IOS, and Android devices.  Did we stop using the new phone because it was too hard or did we just focus on the learning task and before we knew it, we became proficient with our new device?  We also have learned how to program our coffee pots to have a hot cup of coffee ready for us when we wake up in the morning.  We also learn how to use the radio/CD player in our new car and even how to program our VCR/DVR boxes (or at least some of us no longer have a blinking 12:00).

True, there are some people who do not or will not adapt to change.  Back in the last century when I was doing mostly FoxPro work, I lost touch with several friends when Microsoft introduced an object oriented version of FoxPro, FoxPro 3.0.  They had trouble converting to an object approach to programming with its focus on properties, methods, and events.  But the rest of us adapted.  Some of us even had the foresight (or luck) to gain some object experience a few years earlier by playing with the early versions of Microsoft Access.  (Ok, not true object oriented, but object-like).

Another good example, and a more recent one, was when Microsoft Office switched from their earlier hierarchy menu structure for commands to use a ribbon interface that only surfaces the features that you need (or even make sense) when you need them.  A lot of people complained about that change too, but I suspect few if any today would willingly go back to the older menu command structure.

There are some people who say that software companies like Microsoft keep changing their software so we have to pay for upgrades.  More likely, the answer is that developing a good user interface is more the combination of experience learned from each iteration and need to get something out to the public rather than to keep waiting until they develop the absolute best software ever, an ideal that no one ever reaches.  Limitations of software and hardware as well as our own human limitations to inventiveness lends itself to an iterative approach for most things.

Another way to look at things is that as hardware and software capabilities increase making previous tasks easier, users demand more functionality that continues to push software design to its limits.  The first version of dBase after all consisted of nothing more than just a dot prompt, something that was not always obvious when looking at the screen.  We would never accept that today.  However, the transition from that simple dot prompt to today’s modern user interfaces was if nothing else a series of steps, not all of which were always well accepted (remember Microsoft Bob?)  Evolution takes time.  Occasionally a product that comes out that may be revolutionary, so different and so much better and more useful than anything before that it sets the world on fire.  Some might point to the iPad as an example.  However, even here there has been an evolution of tablets and other handheld devices including the Apple Newton that all contributed their pros and cons to what eventually became a resounding success.

And did anyone say the iPad was too hard to learn or too different from the standard Macintosh interface of its day?  Perhaps it is more of a question of people’s desire to learn something new and their perceived view of the benefits of that new technology.  In addition, some people like to explore the boundaries of their knowledge and challenge their creativity.  Some see learning new technology as ‘fun’.  In some ways, this desire to learn new things is like the desire of explorers of past centuries who wanted to see what was over the next mountain, beyond the desert’s barren landscape, or on the other side of that body of water that they could not see across.

Or maybe we just do not do a good job at training people in new technology.  Recently I spent a lot of time in hospitals with my wife.  The amount of technology in health care has exploded.  In fact, one nurse said that much of her training to become a registered nurse was spent learning how to work with all the technology.  And yes, these people do learn the technology, technology that we might say is ‘too hard’.  However, the training is there for them and so is the desire or goal of becoming a registered nurse.

So when you go out to buy Windows 8, what training is included?  Who shows you how to navigate around the new screens?  Unless you are really into technology, where is the incentive if all you want to do is to get out to Facebook or check your email?  The answer and problem is that you have to hunt for it on your own be searching the Internet or looking for local training classes.  So maybe the problem is not that the technology is too hard.  After all, you do not have to be an auto mechanic to learn how to drive.  Rather that the training is either hard to find, too expense, or does not seem relevant to why you need to use a computer in the first place.  Ultimately, I maintain that nothing is too hard with the proper training and the time to use that training to develop new knowledge and skills.  Think about that until next time.

C’ya.

DQS – Managing Your Domain Values

Last week I showed how to build a new domain for my knowledge base, but the domain contained a lot of similar values.  In some cases the customer name was ordered as first name then last name.  In others, the last name was first followed by the first name.  Some names consisted of only the person’s last name.  Others included a salutation like Mr., Ms., Miss, or Mrs.  A few names included commas and many had obvious misspellings.

When I first began, it was difficult to pick a version of a name that might be correct.  As you can see in the following figure, many of the names have multiple variations.  One way to get started is to designate one version as being correct (leaving the green check mark) and changing all the other versions to the Type: Error (a red ‘X’).  You can see and modify the value type by using the dropdown in the Type column.  Remember from last week, that when you designate a value as an error, you need to specify a value in the fourth column to tell DQS what to change the current domain value to when it is found.

I started reviewing the name list and was immediately faced with some uncertainty about whether the customer’s name was Cicily Coney or Coney Cicily.  Because alternate values can be spread throughout the domain that initially sorts the data by the original values, it is usually impossible to see all variations of a name at once.  I initially made a guess that the name of one customer was Coney Cicily.  To do this, I selected one of the variations and changed the Type to Error and then entered Coney Cicily in the Correct to column.  Since this value did not exist in the domain, DQS created a new domain record with this value.  However, also because this value did not exist before, the Frequency column shows a value of zero (0).  Often this is a hint that I may be on the wrong track because I would expect to find at least once occurrence of the correct customer name in the table or spreadsheet.  However, it is possible that the correct name was never entered previously.

In the figure below, you can see the new domain value for Coney Cicily has been added and you can also see two synonyms that have been updated to use this value as the Correct to value.

 DQS020

It is also possible to add a new domain value without going through the process of correcting an existing domain value.  Click the icon as highlighted in the following figure to add a new domain record to the end of the grid table and enter the new value in the first column.   In this figure, I added a domain value for the name: Amanda Patrick.  In both cases, new domain values are automatically assigned the Type: Correct as shown by the green check.

DQS021

If you make a mistake as I did with Coney Cicily, you can fix the mistake by right clicking on the row that defines the domain value and that is currently marked as correct.  Then select Delete from the popup menu.  By deleting the domain value marked as correct, DQS automatically deletes this value from the Correct to column of any synonym row in which it appears.  Only domain values defined as correct can appear in the Correct to column.  I can then add a new domain record for Cicily Coney and use this new value in the Correct to column for all the synonyms.

The process of manually identifying the correct values for the domain and then assigning one of the correct values to each of the other synonyms is time-consuming.  But when you are done, the domain list will group synonyms together.  I can then use the icons above the grid to expand and collapse the synonym list.  By collapsing the list to show only the unique domain values as shown in the following figure, I can also see using the Frequency column the number of times that each domain value actually appears in the source data.

DQS022

By clicking the expand icon, I display the synonyms associated with each domain value and again the Frequency column identifies the number of times that synonym appears in the source data.

DQS023

After I finish reviewing all domain values and either designate them as correct or if they are an synonym, identifying them as and error with a Correct to value I can click the Finish button in the bottom right corner of the dialog.

I can also save my intermediate corrections perhaps so I can go home for the night and continue tomorrow by clicking the Close button in the bottom right corner of the screen.  On the main screen of the DQS client in the Knowledge Base Management section, you will see the name of the knowledge base along with the step it is currently in within parentheses.  In addition, the icon to the left of the knowledge base will display a gold lock indicating that the knowledge base is currently locked for editing.  When you want to return to the finish managing the domain values, open the DQS client main screen and click on the link for the knowledge base and DQS returns you directly to the Manage Domain Values page to pick up where you left off.

DQS026

When I am done managing the domain values and click Finish, DSQ displays a dialog asking if I want to publish the data to the Knowledge Base.  Since I do, I click the Publish button.  When I click the Publish button, I see the dialog shown below telling me that the Knowledge base has been successfully updated.

DQS027

At this point, I can use the published domain to clean other data.  However, I can also re-open the domain to add new values from a new data source or even manually add more domain values.

Next time, I will show how to use this new domain to clean data from an Excel file.

C’ya later.

Knowing When It Is Time To Start Over

Perhaps you have been in IT for a long time or perhaps you are just getting started.  In either case, one of the hardest things to learn is knowing when to start over.  If you have been in the business for any reasonable amount of time, you probably had the misfortune to take over someone else’s project after they left the organization.  Maybe they left on their own.  Maybe they did not.  Maybe the application was written by a consultant who came in and just took the money and ran as soon as the last line of code was committed. Maybe the code has been all yours from the very beginning but due to a lack of planning, the software tends to meander around with a lot of excess code, obsolete routines, and poorly written code.  Maybe it is not your fault.  Maybe the client could not make up their mind about what they wanted and had you re-writing the same modules over and over with different changes.  Maybe the code was well written in a language that is no longer available or multiple versions behind the current release and even though you have the source code, there is no way to compile this old code using modern language compilers without substantial changes.

All of these scenarios and probably more should make you stop and at least consider whether it is a good time to abandon the old system and start anew.  This is often not an easy decision especially when management uses the argument that they cannot afford to abandon the old system because they stuck so much money into that system, that they would look foolish to just walk away from it.  However, they may look just as foolish if they continue to throw good money after a system that  may not be worth salvaging in the first place.  The point I want you to consider is that any decision should not be based on what has happened in the past, on how much money was spent on a system that is now floundering, or on personalities that may have been involved in that project.  The decision must be based solely on the current value of all future costs of each possible project path into the future.

Suppose you had a program that was costing you $12,000 a year to maintain this year, but the cost of contracted labor to support that program is expected to go up by 5% a year for the 10 years of the project’s expected life.  Suppose that instead of maintaining the existing application, you could totally rewrite the application for $100,000 and that it would be ready in 1 year.  You would still have to contract for support for the old program for the first year for $12,000.  However, your expected support cost for the new application would be only $2,000 per year with an expected annual increase in cost of only 2.5% because the new system is written in a very common program language for which there is a large number of developers available for hire.  Which project would you choose?

One way to determine the answer to this problem is to calculate the workflows for some amount of time.  The following table shows a project with an expected 10 year life.

 Project 1

 Project 2

Year

 Annual Cost  Annual Cost

1

 $     12,000.00  $   112,000.00

2

 $       12,600.00  $          2,000.00

3

 $     13,230.00  $        2,050.00

4

 $       13,891.50  $          2,101.25

5

 $     14,586.08  $        2,153.78

6

 $       15,315.38  $          2,207.63

7

 $     16,081.15  $        2,262.82

8

 $       16,885.21  $          2,319.39

9

 $     17,729.47  $        2,377.37

10

 $       18,615.94  $          2,436.81

You could just add up the cost for the first 10 years to see which has the lowest overall cost.  That would be ok if your money had no other potential for gain.  However, that probably is not true.  At the very least, you invest in some money markets or government guaranteed bonds to earn at least 3%.  Therefore what you really need to do is to discount those future costs using the opportunity cost of money (what you could earn from some other investment).   This is called calculating the present value of each annual cost value.  Summing these up over the life of the project results in the net present value (NPV) of the cash flow stream.

Now if we had a project that was making money, we would probably look which project option would result in the highest NPV.  However, because we are dealing with costs displayed as positive numbers, we want the project with the smallest NPV cost.  (Of course, to be proper, we probably should display all costs as negative values.  In that case, we still would want the largest value or the closest to zero when both projects have a negative NPV.)

The following table shows the addition of a Present Value column for each project along with a sum of all the present values at the bottom of the table.

 Project 1

 Project 2

Year

 Annual Cost  Present Value  Annual Cost  Present Value

1

 $     12,000.00  $          12,000.00  $   112,000.00  $       112,000.00

2

 $       12,600.00  $            12,233.01  $          2,000.00  $              1,941.75

3

 $     13,230.00  $          12,470.54  $        2,050.00  $            1,932.32

4

 $       13,891.50  $            12,712.69  $          2,101.25  $              1,922.94

5

 $     14,586.08  $          12,959.54  $        2,153.78  $            1,913.61

6

 $       15,315.38  $            13,211.18  $          2,207.63  $              1,904.32

7

 $     16,081.15  $          13,467.71  $        2,262.82  $            1,895.07

8

 $       16,885.21  $            13,729.22  $          2,319.39  $              1,885.87

9

 $     17,729.47  $          13,995.80  $        2,377.37  $            1,876.72

10

 $       18,615.94  $            14,267.57  $          2,436.81  $              1,867.61
 $       131,047.26  $       129,140.21

In this case, it actually will cost us less over a ten year period if we spend $100,000 today to rewrite the application and thus reduce the annual support costs than to continue to support the old system at its higher annual cost.

While this was a simple exercise to show you the basic concepts of how to determine when or if it is time to throw out an existing application and begin a new one, the concepts apply no matter now complex the cost structure may be.  The key is to not get distracted by the money spent in the past on an application.  That cost is not recoverable no matter what you do.  In economic terms, that is called a Sunk Cost.  Therefore, keep your eyes on what is important, the future and let only the future costs and benefits guide your decision.

Why is this the Tuesday rant?  Too many times I have read stories of companies and management who become so caught up in the money spent in the past for a project, that they refuse to consider the possibility that it may be time to cut their losses and try something else.  As a programmer, deciding to start over is almost like admitting defeat which is definitely not the nature of most programmers.  Do you know any cases where starting over can help you or your organization?

C’ya next time.

DQS – Building Your First Knowledge Base Domain

This week I want to take a look at how I can clean a column of names in an Excel spreadsheet using Data Quality Services (DQS).  The fact that I am going to clean an Excel column instead of a SQL Server table column has little effect on the procedure or steps other than defining the source and destination files.  I have a lot of ground to cover so let’s get started.

First, when I open the client component of DQS, I need to define specify the server where I installed the Data Quality Server.  Last week I explained how to create the DQS server.  Because I installed the server and client components on the same machine, my server name is: LOCALHOST.

DQS010

On the main page of the client tool, there are three panels.  I showed last week how to open the built-in domains included with DQS.  However, this week, I need to create a new domain with data from my Excel spreadsheet.  To do this, click the button: New Knowledge Base in the first panel.

DQS011

This action opens the dialog shown below.  You can have a separate knowledge base for each domain or you can have several domains in the same knowledge base.  Which way is best?  If you need multiple domains to validate different columns in the same database or even the same table, you may want to put these domains in a single common knowledge base.  On the other hand, if you have a domain that can be used in different tables perhaps in different databases, you may want to store those domains in a separate knowledge base.  The good news is the choice is yours, but for this week, it will not matter.  I am going to name my first knowledge base Customers because it will contain domains used to validate customer data.  Descriptions for your knowledge base are optional, but I do recommend them.

Since I am going to use the actual data from an existing spreadsheet to build my knowledge base domain, I select Knowledge Discovery from the Select Activity on the bottom right of the dialog before clicking Next.

DQS012

On the next screen of the dialog, I can select the Data Source I want to use to discover my domain values.  At this time, DQS only supports two data sources: SQL Server and Excel.  For today’s example, I will use an Excel file.  Next, I need to either enter the full filename of the Excel file or I can use the Browse button to find the file on my computer or a connected network drive.

Note: This version of DQS does not support using .xlsx format Excel files.  It only supports .xls format files.  If I need to begin from an .xlsx file, I can first load it into Excel and then save it back to a new .xls file which DQS can use.

After l select the Excel file, I need to create a domain that will hold my customer name values that I pull from the Excel file.  In the area immediately above the Mappings grid there are a series of icons.  The third icon from the left is the Create a domain icon as shown in the following figure.

DQS014

The Create Domain dialog requires a name for the domain.  The obvious name here is CustomerName.  I can also include a description of the domain that could include addition information about either the source of unique values for the domain and/or where the domain can be used.  The Data Type for the values in the domain is typically defined by the data source but it also determines where I can use the domain.  In this case, the following figure shows that I identify customer names with the string data type.

There are some additional parameters such as leading values and normalization of the string which are optional.  The Use Leading Values option allows you to define synonyms for a value and then replace the synonym with that value.  That is what I am going to do here so I check that checkbox.  The Normalize String option simply replaces special characters with spaces which should not matter in my case, but I will leave this checkbox checked anyway.

The Format Output To option allows you to force the capitalization to all uppercase, all lowercase, capitalization, or none.  Because I am dealing with names, I have chosen Capitalization.

Of course my language is English and I also have enabled the Spell checker .  However, I did not select the Disable Syntax Error Algorithms checkbox.  This option is useful when checking complex id values.

DQS015

After creating a domain to store my knowledge base values, I need to select the column from the Excel spreadsheet that I will use to fill that domain.  In my case, the name of that column is Customer Name.  Now with the mapping between my source column and my domain complete, I can click the Next button.  However, be aware that you can map more than one column to individual domains to clean each of the columns at one time.

DQS016a

The following figure shows the next screen in the dialog that starts the data discovery process by clicking the Start button.  At this point, the statistics information at the bottom of the screen will be empty.  However, after I click the Start button I will see the statistics being updated live as the data from the source is read into DQS.

DQS017a

The following image shows the results after finishing the data discovery task.  In the Source Statistics table and grid, you can see that 1341 records were read resulting in 1341 new values and 274 unique values.  Note that in this case, I am loading my Excel data into a new domain, so all the records are new values by definition.  However, if I were to come back later to this domain with a new Excel spreadsheet that included both new and old records, the number of read records would not be equal to the number of new records.  The fact that the number of unique records is far less than the number of new records indicates that many of the unique record values appear multiple times.

DQS018

Now that I have loaded my data, I can click Next to manage the domain values I loaded.  In the next figure, you can see a small sample of the data added to my new domain.  Each unique value in the domain appears once along with a count of the frequency of that value.  This grid also contains columns to identify the value type which as we will see can either be correct, error, or invalid and a column for a corrected value.

First let me just say that the difference between an error value and an invalid value is that an error value, represented by a red ‘X’, can be corrected and should have a corrected value in the fourth column.  An invalid value, represented by a yellow triangle, is a value for which there is no correction.  A good example of an invalid value in my names table would be a record in which the customer name was a number.  Unless I have a table that associates a customer id with a customer and somehow this information appears in my source data instead, a number value typically cannot be converted to a customer and is therefore invalid.

DQS019

Next time I will cover more details about how to fine-tune my domain so that I can use it with other data sources to ‘fix’ invalid names.

C’ya later.

February is History Month

February is designated as Black History month.  Isn’t that just a bit prejudice?  No, I am not saying the issue is as simple as being black or white.  What I am asking though is what about all the other groups that make up our society?  What about American Indian History Month? Asian History Month? Latino History Month? Jewish History Month? Arab History Month? …

I could go on, but I hope you get the point.  It should not be about any one racial, ethnic or religious group because they all have contributed to our shared history.  Some contributions have been positive and some have been negative.  However they are all important.  We should look at history as a large social experiment that we can learn from and improve upon.  John Lennon said, ‘Life is what happens to you while you’re busy making other plans.’  Isn’t that history?

Not all of history is pretty. In fact, much of it is pretty ugly.  People sometimes have a tendency to tune out when things get bad.  This often allows things to get even worse.  I think that is largely true of today’s general political apathy because too many think there is nothing they can do to fix the problems in this country much less the world.  However ignoring problems do not make them go away.  More likely, it encourages people with ‘evil’ intentions or self-centered narcissism to seize control and fashion history to meet their needs and goals, not necessarily what is best for the general population, individual interest groups or much less you and I.  Studying all history helps you see trends, both good and bad and helps you better prepare to guide the future rather than to let life just happen to you.  Your plans may not always work, but at least you tried.  That takes effort and it can be hard.  John Lennon also said, ‘Living is easy with eyes closed, misunderstanding all you see.’  The moral is, don’t walk around with your eyes closed.

I’m recommending that we all open our eyes to what is happening around us.  Study history to see what it can tell you about trends and patterns you see repeating today.  Listen to news from a variety of sources because each one is biased in their own way.  That includes news from outside of your country.  I personally recommend the BBC as one outside view of things happening in United States, but there are others.  Keep in mind that today’s current events are tomorrow’s history.

I am sitting here looking at a painting on the other side of the room.  Every color in that painting contributes to the overall beauty of the picture.  The same is true of every race, culture, religious group, etc. in the picture of the human race.  Therefore, let’s make February a true history month, not just about one race, as great as their contributions have been, but a multi-cultural history month.

C’ya next time.

Putting Quality Into Your Data – Part 2

Part 1 can be found at: https://sharepointmike.wordpress.com/2013/01/26/

If you previously installed SQL Server 2012 on a test machine but did not specifically select Data Quality Services (DQS) (the server component) from the Database Engine Services group or the Data Quality Client Integration Services and SQL Server Data Tools from the Shared Features group, you need to go back to your install disk to include these three components on the feature selection page of the 2012 Installation wizard.  You probably should also include Management Tools – Complete.  These can be installed on different machines, but I am going to assume that your test environment, like mine, is all on the same machine.

Before you can run Data Quality Client, you need to run the installer.  You should find the installer in the Microsoft SQL Server 2012 group of your Start menu in the Data Quality Services folder.  The installer name is cleverly named Data Quality Server Installer.  The installer will run in a Command window.  It first prompts for a password for the Database Master Key.  This password must be at least 8 characters long and contain a mix of upper and lower case characters, numbers and special characters.  After entering the password, you must re-enter it to confirm it.  Upon completion of the install, you should see the message: Press any key to continue.

Next you have to define Data Quality Service Roles for users through SQL Server Management Studio.  For test purposes, you at least need a role for yourself.  In the Security folder of the database server create a new login or use your existing one if present.  You must use Windows authentication for this login account.  Next select User Mapping from the Login Properties dialog.  Select the database DQS_MAIN and set the role to dqs.administrator.  Then close the dialog.

You can now make your data available for DQS Operations. You can do this by copying your data to the DQS_STAGIONG_DATA database.  To do this you need an account with db_datawriter, db_datareader, and db_ddladmin rights on this database.  However, you can also simply create a copy of your database in the same server instance as the Data Quality Server which in this case is my test server.  Because I am running both Data Quality Server and Data Quality Client on the same machine, I do not have to worry about installing and configuring TCP/IP Network Protocol to remotely access the Data Quality Server.

I can now start running the Data Quality Client from the Microsoft SQL Server 2012 group of the Start menu selecting localhost as my server.  After connecting to the Data Quality Server, I will see the Data Quality Client home screen shown below.  Now the fun can begin.

DQS001

Obviously, the Data Quality Client has three major sections; Knowledge Base Manager, Data Quality Projects, and Administration.  These are arranged left to right in the order that most people use them.  In other words, to begin you must have a knowledge base against which you will evaluate your data.  Out-of-the-box, SQL Server 2012 includes a sample knowledge base that includes eight domains which can be used to correct and validate data.  These domains include the following:

Country/Region – This domain includes various abbreviations for the countries of the world including the international three-digit country identifier also known as the ISO 3166-1 numeric code.

DQS002

DQS003

It also includes the 2-character ISO 3166-1 alpha-2 code and the 3-character ISO 3166 alpha-3 code.  It each of these cases, it translates the value to the full country name.

Country/Region (three-letter leading) – This domain corrects data to the ISO 3166-1 alpha-3 code.

Country/Region (two-letter leading) – This domain corrects data to the ISO 3166-1 alpha-2 code.

US Counties – This domain lists all counties in United States (including parishes, municipalities, boroughs, etc.)

US Last Name – This domain consists of over 151,000 last names which occurred 100 times or more in the 2000 census.

US Places – This domain includes the names of places within the 50 states of United States, District of Columbia, and Puerto Rico pulled from the 2010 census.

US – States – This domain allows you to convert state abbreviations to display the full state name.

US – State (2-letter leading) – This domain allows you to convert state names to display the two-character state abbreviation.

You can also create new knowledge databases, a topic I will cover at a later time.

The Data Quality Project section allows you to create and manage your data quality projects.  This is where you will go to clean your data and perform matching.  DQS will also make suggestions in this section to correct your data.  Specifically, Data Cleansing uses rules to analyze the data and make changes to the data.  Data Matching can be used to group similar records to help you identify and remove duplicate records.

The Administration section is where you can monitor your data quality activities, configure the server, and reference data providers.

That’s all for this week, Until next time, set up your Data Quality Server and Data Quality Client so we can begin using and creating knowledge databases and set up our first data quality project.

C’ya next time.