Saturday, August 18, 2012

Data Quality - Running in the Badlands

I like to run a little. Actually, I like to run a lot. Back in the days when my work had me travelling most weeks one of the upsides was the opportunity to regularly find new running routes in new cities. Most times this resulted in an enjoyable run and the chance for a bit of sightseeing. However on occasions things didn't go quite so well. Several times I've felt very uncomfortable and one on occasion I had genuine fears for my safety.

Three such times stand out for their similarities although they occurred in different cities around the world - Las Vegas, Milan and Philadelphia. On each occasion I was speaking at a conference and so had a little time to kill, but not enough time so that I could run too far from my hotel. All of these cities promote themselves as either tourist or conference destinations and to the casual observer seem to fit this brief well - safe, busy, functional and (in two of the three cases) prosperous. At first glance it would seem that nothing is amiss and that the "service" they offer is well provided; they are "fit for purpose", if you like. However in each city less than six blocks from my hotel I found myself running through either seriously rundown areas, areas with (in one case) chain mesh fences and burning cars or areas where I was harassed and threatened.

Now, I admit that probably 99% or more of tourists or other visitors to these cities will probably never see the areas I did nor find themselves in a threatening situation. It strikes me that there are parallels with data quality in many business systems: the places where everyone goes, those areas that support the most common transactions and queries have had most obvious data quality issues sorted out. In most cases the people using these systems largely have no problems which stem from data quality issues so over time the perception develops that data quality is not an issue. Even when the odd data quality problem does crop up often its dismissed as an outlier or the pain of it forgotten before it can be used as a driver for a data quality improvement initiative. Much like the cities I visited all seems to be working smoothly as it should.

However on my problem runs I definitely didn't get the result I was after - I didn't enjoy sightseeing and the training effect I had planned for didn't eventuate as I either had to cut my run short or ran (away) at a faster pace than I had planned. Without doubt there was at least a loss of effectiveness and probably a failure to get to the desired outcome. The same occurs in our systems when data quality issues lurk just out of plain sight. When users venture into those areas, be it with ad-hoc requests or with infrequently run analytics, the outcome they seek may well either be missed entirely or hampered by inefficiencies. These issues have knock on effects too - perhaps someone reading this blog won't run next time they attend a conference in one of those cities or may even choose another conference entirely, in much the same way users may choose not to again wade through an analytic process in a certain area of data, preferring instead to rely on gut feel or their own sets of numbers from desktop spreadsheets.

So should we as data quality or data governance practitioners act to rectify these less mainstream data quality issues. The answer is probably "it depends". For me, it's a cost benefit question, if the impact of the problem is bigger than the cost to fix it then there could be a strong business case to be built for a fix. The bigger issue though is not knowing that the problems exist in the first case, so perhaps it is beholden on us to attempt to understand where these problems lie so we, and our stakeholders, can best choose where to focus data quality improvement efforts. Interestingly this very style effort actually is in play at the hotel I stayed at in Philadelphia. Seeing me walk back into the lobby in my running gear the hotel concierge asked how my run was and then offered me a pre-prepared map showing run routes and a very clearly marked border indicating the outer limits of where it was safe to run. I only wish they had advertised the existence of this data steward before I'd gone out running!

1 comment:

  1. You mention the biggest problem is: “not knowing the problem exists”. I wonder if the business understood the cost impact of the "problem" better would they be more focused on creating (=investing in) an environment/culture where people - if finding themselves down the wrong route - are inclined to mention this in a way that rectifies the problem for others? Having mechanisms, tools, or better still people (enter: data steward) to alert colleagues of the route to take? Is it a chicken and egg thing - which comes first? A cost benefit analysis or the data steward? Curious to know how to gauge the "cost of the problem" when it's not transparent (at least to me ... and I confess to knowing nothing about data quality) how to collect the financial impacts?