Sunday, February 5, 2012

The Data Quality Accord - Initial Thoughts


Poor data quality affects the business. It forces higher costs and introduces unwanted risk. The cost is tangible (if you measure it) and affects the bottom line. The risk, however, is much harder to quantify in terms of how it affects the business. With the higher increase of data versus business growth [rubinworldwide] - this problem is not likely to go away anytime soon.

Cost of poor data quality include the cost of reacting to the presence of bad data either by providing ad-hoc solutions, taking responsibility for the after-effect, or by fixing the problem (at source - of course). This can be measured in terms of delays, lost opportunities, time/money needed to resolve/fix issues/consequences as well as the cost of implementing a remedy.

The risk aspect of poor data quality, which is the real subject of this entry, may involve reputation, operational, regulatory and legal risk. Risk is commonly defined as the product of probability and outcome. The higher the impact, the more you would like to control and reduce the chances that something bad might happens. Ideally, the business will make an informed decision as to how much of a certain risk it wants to undertake. This is necessary to ensure managed boundaries which accommodate both opportunity and flexibility. Take a food market stall for example. The merchant knows, more or less, how much produce he might sell. If he offers too little, some customers will end up disappointed - but he will not waste any product. On the other hand, if he brings too much - he is likely to be left with unwanted produce. The merchant will therefore use his experience (amongst other means) to make a decision of how much produce to offer.

Financial institutions make similar decisions in terms of the exposure they are willing to take towards certain clients and industries. The Basel Accord is one of the main tools used to  evaluate the amount of financial risk a company takes, and more importantly, the level of safety a financial institution needs to employ. This ends up with regulators prescribing the amount of capital these institutions need to keep aside for a "rainy day". It becomes very complex, very quickly, but in a nutshell: The amount that is kept aside is a product of the likely amount of money that would not be recovered should borrowers fail to deliver on their obligations (default) and a function of  the parameters which affect the chances of this risk materializing. In other words: impact X probability.

Here is the argument when it comes to data quality: It stands to reason that every company, not only financial ones, needs to seek some sort of safety against the impact of poor data quality. It is true that the Basel Accord looks at certain aspects relating to the quality of the data - but the risk is not limited to industry of financial products. Think for example about the impact of poor quality medical information and its possible consequences...  I think enough said!

A Data Quality Accord would define the level of provisions a business needs to employ to protect the (un)likely(?) outcomes of poor data. Taking the risk model of quantifying risk, there are a two  aspects to consider: Firstly, what is the potential impact of poor data quality and how well can the business respond to occurrences of poor data ("Loss Given DQ") ? Secondly, how do we evaluate the probability of the risk materializing ( represented  by function for probability which we can refer to as " f " )? The first one can be addressed by analyzing the various scenarios of what would happen if a machine is misaligned, a doctor is misinformed or a business partner provides or receives poor data. The second dimension, or the qualification of the probability, is a much more complex and interesting problem.

What affects the chances that poor data will bring negative outcomes to the business? Taking this question through a logical journey - here are some thoughts:

1. Impact of poor data requires poor data to be present (trivial). So the first parameter in the probability function is how likely is it that the data is poor (let "P" represent the presence of poor data)
2. The presence of poor data does not necessarily means it would lead to undesired outcomes. The second parameter is therefore the level of protection the business has to divert the effects of poor data quality. (let "DG" represent "Level of Data Protection". Why DG? The level of successfully applied Data Governance and the implied data management functions)
3. These aspects need to be summed against the number of instances and frequency at which the poor data might be used (let "data life" represent the places where data might be used and might lead to negative impact on the business).

Therefore I propose that Data Quality Capital be defined as:

DQ Capital = (Loss Given DQ) x f ( DG, P, data life)