As the Pendulum Swings

When I first started with computers, they were huge monster sitting in a closed off air-conditioned room which no one other than the chosen ones could enter.  You wrote programs on punch cards or paper tape and submitted your program deck through a window to a clerk sort of like going to a bank.  You then waited, sometimes for hours, before a servant of the machine brought out a stack of paper with the results of your program.  You always feared that your paper stack would be thin because that often meant that your program failed and you would have to go back through the card deck and find out where your logic went wrong.  You could spend a whole day, sometimes several days in the dungeon room where they kept the card punch machines.  The noise of the machines punching out your card was deafening.  However, we felt that we were doing something important, that we were powerful, that we were able to control the output of the machine by our commands.

Time passed and the technology evolved.  First there were CRT or green screen terminals in which you could type and edit your program.  There were no graphics, no Windows, no fancy editors.  On the other hand, we no longer had to fear dropping a deck of a couple hundred punch cards and having to resort them before we continuing to do work.  Life was great.  How could it get any better?

When the first micro computers were introduced, they were more for the electronic hobbyist than for the serious application developer.  Using toggle switches to enter a program would never catch on for the average person.  But again the technology evolved and soon we had personal computers with a whopping 4K of memory that could be used to store and run programs.  Programs were stored on cassette tapes.  You know, those things you could buy albums on that looked like mini reel-to-reel tape decks.  This made it possible to not only store your program creation, but also to make copies on multiple cassettes and share or sell your creation to others with the same computer model.  But for most companies, these machines were still toys and not meant for serious work.

Over time, the toys became more powerful  and soon software appeared to help you create and print documents.  This was a magnificent improvement over a standard typewriter.  If you needed to make a change to a document, you could simply go into the text and edit the document and reprint it.  This sure beat having to retype the entire document.  I remember our department secretary resisted the change at first because no one was going to take away her trusted typewriter.  However, our department head had other ideas.  Instead of dictating correspondence or writing it out long hand for her to enter, he gave her a floppy disk (yeah that is what they were called because unlike a hard disk, they were kind of floppy) with the document.  Then after she printed it, he would purposefully make changes forcing her to learn how to make those changes on the desktop computer.  Within a few weeks she saw the light and asked to have her old typewriter removed and the computer placed on the center of her desk.

Many years have passed and decentralized personal computers slowly replaced most of the mainframe computers and their slightly smaller cousins the mini computers.  A lot of people resisted the change, but the momentum was too strong.  As applications because more powerful and the need to share documents and data became more of a concern, network computing came into existence as a way to link all these separate personal computers together electronically to allow the sharing of their information.  This probably marked the turn of the first pendulum pass from large machine computing to small personal computing back to a more collective approach.

Over the subsequent years, networking became more robust and the introduction of specialized servers to hold documents, databases, and even to facilitate communication grew.  It was soon not uncommon to see companies with hundreds of networked computers with centralized file stores called file shares and centralized data called databases.  But early networking only worked with a company’s walls.  There was no sharing with the outside world.  Yet the need to share data with customers and suppliers forced a new wave of innovation to solve this problem.  Modems and remote access to other machines soon grew in use, but it was not enough.

One day a new technology started to be talked about.  It was a way for companies to share information without dedicated connections to others.  It used something called the Internet, a communication backbone that anyone could in  theory access to send information to others.  Some initial companies provided services to allow people to get instant access to news, stock information, and even to communicate with others having a similar connection to the company.  While they were successful for several years, the Internet was bigger than they were and soon took on a life of its own.

Today access to the internet seems like an unalienable right.  Some younger children have never known a day when there was no Internet or for that matter hundreds of TV stations not only on cable, but directly available through that Internet.  At the same time servers at companies to hold documents and data proliferated until they started to talk almost as much room and require almost as much infrastructure in terms of power and cooling as the original large mainframe computers.  In fact, many of the servers are more powerful than those mainframe computers of just a few decades ago.

Now the change is to move everything to the cloud.  The cloud has this magical spell on some people like the Pied Piper.  They think the cloud is limitless and they will never have to worry about their data, someone else will.  There are all different types of clouds with a variety of different services to support their customers, but the main thrust is to move data out of the corporation to a ‘trusted’ box somewhere else so that the company no longer has to support the infrastructure.  Furthermore, with the push for hand-held mobile devices, the emphasis is to place the bulk of the computing on these servers as well and merely use the hand-held devices to display the data.

This sounds to me like we are heading back to the days of big centralized computing.  But it will only last so long.  One of the major concerns that could start the pendulum swinging back the other way again is security.  With data from not just one company, but potentially hundreds of companies in a single site, the temptation to hack and steal the data becomes irresistible to some.  It may only take a couple of major breaches until people begin questioning the wisdom of these centralized cloud services.  It was only the other week that the IRS ‘accidentally’ released thousands of social security numbers on the Internet.  Oops!  When the government starts collecting all of the health care data for the entire nation in their databases, what makes you think for a second that the protection of your health data will be any better protected?  It will make the Sarbanes–Oxley Act of 2002 look like a joke.  Beside, concentration of that much data, just like the concentration of power, is not a wise thing.  Someone will always be out there looking for a way to exploit that concentration.

So will the pendulum start to swing back again to the personal devices.  Will tomorrow’s hand-held devices, whatever they look like, be more powerful than today’s servers?  Will it be common for individually to walk around with terabytes of information in their personal devices?  Will the computing capability of these devices, aided with voice control and artificial intelligence to create new program solutions dominate the next pendulum swing?  Or will personal computing devices go away entirely and be replaced with publicly available interfaces that can access your information from anywhere and be able to perform research and develop applications that anyone can create by simple voice requests?

What do you think?  C’ya next time.

Advertisements

Create a Matching Policy – 1

No, this is not an on-line dating service.  Rather, I am trying to analyze a table with duplicate records.  These duplicate records can be just as bad for my pivot table analysis as having data with bad values.  Earlier this year I showed the basic of how to use Data Quality Services in SQL Server 2012 to create Knowledge Bases that could then be used to clean other data tables.   However, cleaning bad data out of my analysis tables does not alone guarantee good results.  Another potential problem with data could be duplicate data.

Duplicate data can come about for a variety of reasons.  Someone may load the same table more than one time into a master table.  Depending on the way indexes are defined in the master table, an error may not be thrown.  How does that happen if I have a primary key?  Perhaps my master table generates a unique surrogate key for each record added to the table rather than use the existing primary keys.  This technique is often used when adding data from multiple data sources in which the unique primary key in those data sources may no longer be unique across all of the combined data.  Therefore, I cannot use the primary key from the individual data sources in the master table and instead generate a unique surrogate key.

Another reason might be that I am pulling data from a legacy sales system that recorded the customer’s address with every purchase.  If any one customer made two or more purchases, their address data would appear multiple times.  In the new system, I want to maintain a central customer table to store the customer’s information once.  Therefore, I have to identify the duplicate customer addresses and save only one address per customer.

I might also have a table of businesses who have contributed to my charity.  Over the years, I may have received donations from the same business multiple times.  Therefore, there are multiple records in the contributions table, one for each year.  Now I want to consolidate the businesses that I have received donations from so that they only appear once in the final table

Let’s assume that I previously loaded a master business table and I have already gone through the DQS process to create a clean (or as clean as reasonably possible) knowledge base.  The next step is to open the DQS Client and click on the arrow to the right of my Knowledge Base (Business Master in this case) to open the popup menu shown in the next figure.

Matching01

The next step is to select a table to use to define and test a matching policy.  This table can be the original table used by the knowledge discovery step to create the original knowledge base.  Alternately, it can be the table in which I want to search for duplicates.  Even if I opt to use the table that contains the duplicates that I want to fix, I often create a representative subset of the full table as a test table.  I do this so that defining and testing the matching policy can execute faster.  Just be careful when selecting a representative subset of data to not skew the analysis by accidentally selecting data that came only from a single data source and thus may not show the duplicates.  For this reason, I would probably select all of the business in which the business name began with a specific letter or two.  I could also use business type (if that were a field in my data) to select only businesses of a specific type.  Any way to narrow down the number of records will help testing the matching policy execute faster.

In the follow screen, you see that I have selected the Business table from the Customer database in SQL Server.  Note that DQS can also work with data from an Excel spreadsheet as the data source.

Matching02

Next in the Mappings section of the screen, I need to select all the source columns that I want to use in my matching policy.  Note that I must use at least one column and could use all of the columns in the source table.  However, realistically, a smaller subset of columns are used to define a matching policy than may be used during full data cleansing.

For each source column selected, I must select a Domain from my Knowledge base to use.  The way DQS uses the domain is to clean the selected column data against the knowledge base domain before attempting to match records.  By cleaning the data first, the matching process is more accurate.

Matching03

After selecting all the columns I want to use in the matching policy, I click Next in the lower right to go to the Matching Policy definition page.  Of course, no matching policy exists yet so I must click the Create a matching policy icon on the left side shown in the next figure to create a matching policy rule.  Note that there is only a single rule per matching policy.  After clicking the option to create a matching rule, the icon to add a matching rule will disable.

Matching05

By default, the Matching Rule gets the name Matching Rule 1.  However, you can change this name in the Rule Details section as shown below.  You can also provide a description for the matching rule.  While I usually add a description that defines my ‘strategy’ for the matching in this area, I will leave it blank for this illustration.  You can also select the minimum matching score for two records to be considered a match.  The default value of 80 is the minimum score that DQS allows.  However, you can set the score to a higher value.

On the right side of the figure, you can see that the Rule Editor is looking for at least one domain element for the rule.  You can add your first domain element by clicking the Add New Domain Element icon to the top right of this area.

Matching06

For this example, I chose to add Business State as my first domain element.  When defining matching rules, a domain can be either an exact match or a similar match.  An exact match, as implied by the name, must be spelled exactly the same in both records.  On the other hand, a similar match does not have to be spelled the same.  Microsoft uses an internal algorithm to assign a value from 0-100 to each set of matching values based on the similarity.  DQS uses this value to calculate the overall matching score.

One twist to the exact match is that it can be viewed two different ways.  You could say that an exact match is a prerequisite for the two records to be considered a match.  When Exact is used as a Prerequisite, if the two values are not exact, DQS does not even evaluate the other domains in the rule to see if they match because quite frankly, it does not matter.  On the other hand, if Exact is not used as a prerequisite, then the value for this domain comparison will be either 0 or 100, but DQS will continue to evaluate the other domains in the rule.  After adding all the domain match values together, if the weighted average is greater than the minimum matching score, the records can still be considered a match, even if the domain with the exact match requirement fails.

Ok, I’m going to stop here for this week because that last paragraph may sound a bit confusing and I want to give you a few days to think about it so that it begins to make sense.  I’ll mention it again next week as we continue defining the matching policy and show how Exact can work two different ways.

C’ya next time.

Getting Back Into the Groove

I first want to apologize to my regular readers for being ‘absent’ for much of the last two months or so.  The last six months have been some of the hardest, yet some of the happiest, (for very different reasons) in my life.  The hard part as those who know the situation took up most of the first half of the year, but the happiest part is seeing my daughter move from a residency program at a local VA hospital to a full staff member as of July.  Her success after going through all the same hard times as me gives me hope.  So the last month we have been spending time looking for a new apartment for her, looking for furniture, packing what she wanted to move over for now, carrying stuff over in cars every night and unpacking it into her new place and even the ‘joy’ of building IKEA furniture.

She didn’t move far.  Just across town.  Far enough to have her own life, yet close enough for visits.  I don’t know if she truly realizes how proud I am of her accomplishments, but I am.  I don’t know if we were just lucky or if Sue and I (mostly Sue probably) did something right in raising her, but for the life of me I cannot figure out what that may have been.  So don’t ask for advice.

Anyway, I suppose it is time to start over and to start writing again.  After all, the house is empty now except for my cockatiel and before her cheeps start to make sense, I think it is time I actually start typing real words.  So, over the past week I’ve picked up paper and a pen and started to write down several ideas and I’m sure some of them will eventually develop into blog entries.  For today however, I just want to leave you with some random thoughts that will probably never make it to a full blog entry.   (By the way, the bird is sitting on my shoulder right now watching everything I type so I have to be careful what I say.)

I’ve noticed in the newspaper lately that a lot more motorcycle accidents are reporting that the injured riders were not wearing helmets.  Did you ever wonder why the government was so concerned about whether we automobile drivers have a seat belt on while we are driving and will even fine us several hundred dollars if they stop us and we do not have a seat belt on yet motorcycle riders don’t have to wear any head protection?  Seems odd.  Maybe it is a Darwin thing and we just want to thin out the population of those whose heads are too thick to be injured during an accident where they lose control.

I’ve also noticed that every major storm, hurricane, tornado, dry spell, heat spell, etc. has been linked by the media to global warming.  (Some people even think that Sharknado was a real documentary and was caused by global warming.  Some of these same people also ride motorcycles without helmets.) They act as if extreme weather never occurred before they discovered global warming.  While there may be some truth to the connection, their level of conviction that they are right and that everyone who does not agree with them is wrong or perhaps stupid seems to put people off who might otherwise at least consider the possible connection.

Currently the big story is the Royal Baby in England.  I will grant you that for those people living in England or originally from England, that is probably a very important story.  But really, this baby (I don’t think he has been given a name yet or at least I have not heard it) is only something like third in line for the throne.  On the other hand, that is probably a lot closer than you or I will ever be.  Anyway, Cheers! to England.  At least they have some good news in their media for a few days making it worth watching the BBC.

I guess Earth missed getting hit by a asteroid the other day.  It was suppose to be between 200 and 400 feet long.  Let say something around the size of a football field.   Unless it totally broke up or burned up during entry into our atmosphere, I suspect it would have made quite a dent in your car if it fell onto it.  The amazing thing was that it was only discovered a few days before its closest approach.  Yet we are being told by astronomers and the government that they have mapped over 90% of the trans-Earth orbit asteroids and would know well in advance of any potential problems.  No wonder they didn’t notice Clark Kent’s spaceship during that meteor storm.  I guess that few extra percent can really make a difference, especially if the asteroids targets your local corn field.

Finally, I leave you with this thought.  I’ve noticed that average employees who leaves a company to go into consulting becomes an expert overnight in whatever field they are talking about as long as they charge more than $200 per hour and travel at least 500 miles to the client site.  Similarly, I’ve seen consultants get hired by a company for their expertise in some technical area and overnight in the new job become dumber than a door nail (whatever that is) whose opinion is not worth a wooden nickel.  I guess it just goes to show that knowledge in any area is fleeting.

This Saturday, I plan to pick back up with a technical article on Data Quality Services where I left off months ago showing how to create a matching policy to find duplicate records.

Till then, c’ya.

Is the System Broke or Can It Be Fixed?

It strikes me as funny in how similar IT can be to national issues.  Sure they are different in concept, but both have many problems that are similar.  Take for example any application that has been around for more than a few years.  It probably has been patched and added to several times as the needs and wants of the organization have changed.  Over time these patches make following the code increasingly difficult.  The term ‘spaghetti’ code began as a way to visualized how complex the flow paths have become after repeated changes making it hard to follow any one path from the beginning to the end.  Eventually a point is reached in which the application needs to be rewritten to straighten out the code (the spaghetti), clear out unused code, simplify the data flow, and generally make the system more user and developer friendly.  So how does this apply to national issues?

Recently the IRS has been caught in several scandals.  A few weeks ago you probably heard that the IRS targeted conservative organizations.  While it is true that the rapid growth in the number of 501(c)4 groups claiming tax exempt status which is only supposed to be used for educational and social welfare groups have possibly been misapplied to groups primarily formed to promote political points of view, it seems that the keywords used to identify these groups definitely leaned toward conservative groups more often than not.

That was bad enough.  However, today I woke up to reports on MSN that the IRS ‘accidentally’ published thousands of Social Security numbers online.  If I were conspiracy theory minded, I might ask whether there was any pattern to the numbers that were ‘leaked’.  But let’s not go there for now.  The fact that this was allowed to happen is bad enough no matter what else may have been behind it if anything.

Some other news items related to the IRS include the accusation, also on MSN, that the IRS sent $46 million to 23,994 ‘unauthorized’ aliens at 1 address in Atlanta.  (Would it have made a difference if it were ‘authorized’ aliens?)   It would seem that sending that many refunds to one address, unauthorized alien or not, would send a flag up to someone.  Supposedly this ‘error’ was found by the Treasury Inspector General for Tax Administration.

It was also reported in MSN that the IRS handing out $70M in bonuses.  Now it did not say specifically who got those bonuses, but it appears to be related to some type of union contract with the IRS.  Now if this were phrased as ‘cost of living’ increases, I don’t suppose anyone would raise an eyebrow, but bonuses?  I work for the government too, state, not federal, and we have not gotten a raise in over 4 years much less a bonus until this year and that raise does not even cover the increase in our insurance premiums and other cost of living expenses.  So how do IRS employees get bonuses especially when federal government agencies have been directed by the administration to cut discretionary spending?

In the final and last report for today on MSN, the IRS appears to enjoy rather liberal travel expenses.  It has been reported that they spend $50 million on hotel suites, dance classes (I suppose this is to allow the auditors to dance around the letter of the law) and baseball games.  We have not been able to go to any job related conferences.  In fact, last year when I spoke at the SQL PASS Summit conference, I had to pay all of my own expenses (those not covered by the conference) even though I was a speaker at that conference. 

Ok, $160 million is not going to fix the federal deficit in of itself, but come one, $46 million here, another $70 million there, $50 million somewhere else and a couple of million we haven’t found yet and before you know it, we’d be talking about some real money.  Can the system be fixed or should it be replaced with either the Flat Tax or Fair Tax alternative?  Coming from a computer programming background, we learn very early on that sometimes it just makes more sense to scrape the current system and create a new one because the old system has too many patches, changes, leftover dead code, etc.  Sound a little like the tax code?  Sure transitions are painful to some, but are we better off with a new system rather than trying to add more patches on top of an already overly patched system?

Please consider that this is not a question of eliminating all taxes (although that may come later).  Rather it begs the question as to whether the system is really broken and needs to be ‘rewritten’. 

 C’ya next time.

Governance: To Be or Not To Be

It has been a long time since I’ve written any posts about SharePoint so I want to take this opportunity this week to ask you a single question about your SharePoint site.  Do you have a governance plan in effect that has been approved and backed by upper management?

Governance is one of those tricky terms that can mean different things to different people and unless you get everyone in the room to agree with your definition at least as long as the meeting lasts, you probably won’t get your point across.  For example, some people in the room might think that governance only relates to project decisions.  Perhaps this is the result of books like IT Governance by Peter Weill and Jeanne W. Ross.  A good book, but it focuses on how to make decisions that will ultimately lead to appropriate management and use of IT, not on how to implement SharePoint or any other tool.  It looks at who within the organization should make decisions, how they should make decisions and how to monitor the results of those decisions.

Not that those things are not important to management, but to the people in the trenches, especially the SharePoint trenches, governance is more like the topics covered in the book ‘Governance Guide for Office SharePoint Server by Microsoft or any of these books:

  • Practical SharePoint 2013 Governance by Steve Goodyear
  • The SharePoint Governance Manifesto (http://bit.ly/SPGovManifesto)
  • Essential SharePoint 2010: Overview, Governance, and Planning by Scott Jamison
  • Microsoft SharePoint 2013: Planning for Adoption and Governance by Geoff Evelyn

While these book references are just a sampling, I suppose that means that governance to the staff responsible for maintaining your corporate SharePoint sites is more concerned with topics such as:

  • The design of site templates to provide a consistent user experience
  • Quotas to limit SharePoint from becoming a trash dump of every file that ever existed.
  • Locks to control who has rights to add, modify, or delete content, or even to view it.
  • Workflows to pass approve changes to pages and documents, to automate forms and to create simple data collection applications rather than using a programming language.
  • Who can create sites and who can delete them
  • A systems of records management to catalog files stored within SharePoint and to removed aged files when they are no longer needed
  • Content types to define what type of data can be stored in SharePoint
  • Content approval to determine what actually gets saved
  • Versioning to allow tracking of changes and the ability to roll-back changes when necessary
  • Content appearance such as font families, styles, and sizes, colors, page layouts and other physical attributes of the pages.

Ultimately, I suppose the definition of governance requires some governance.

But even if you manage to create a governance document with of the above rules and guidelines, your next challenge will be how to implement that plan and how to get all the people involved to follow the rules.  The more people you have contributing to the content of your site, the more difficult this challenge becomes.  That is unless everyone in the organization knows that the SharePoint Governance document has been approved by the very top of your corporate management.

But even that may not be enough unless there is a way to ensure compliance with the governance.  Rules that are created, but not enforced are merely suggestions.  It will not take long until the common look and feel that you originally planned for is lost and chaos fills the gap.

Unfortunately, many SharePoint projects fail when governance is treated as a platitude or a wicked problem (http://bit.ly/WickedProblem).  Governance can fail when SharePoint is so huge that no one wants to be responsible for all of it and perhaps no one has interest in being responsible for all of it anyway.  That is because most organizations are filled with people with divergent thinking typically centered around their core responsibilities.  This problem can only be reigned in by a central core governance committee that has the power to create the governance document and enforce it.

There are also some people who feel that SharePoint should not just be treated like another tool that the IT department has brought in-house and then thrust onto everyone.  Rather, SharePoint should be looked at as a ‘Change Project’ that will change the way people work in an organization presumably to become more efficient and productive by providing:

  • Internal and external web sites that are easier to navigate than the previous sites.
  • Collaboration platforms to increase communication between team members to exchange ideas provide the group knowledge base for teams and projects
  • Document repositories that can be searched allowing users to quickly find information without having to search through folders within folders within folders.

While I cannot teach you everything you need to know about SharePoint governance in a single blog post, I tried to provide you with at least a few references to get you started finding out why SharePoint without governance at your organization may be the reason that SharePoint is wobbling more than a tightrope walker crossing the Grand Canyon in a light breeze.

C’ya next time.