Data Training Courses

Approaches to Improving Data Quality

There are several approaches we can take to improve data quality. These include:

Note: We’ll not talk about data protection: anonymise, secure, audit/log

A good visualisation often shows up unanticipated data issues, not caught by pre-meditated rules.

With modern tools, exploratory visualisation is easy and quick

Check against business rules e.g.

Inspect individual rows and totals - both counts and amounts.

Flag with exception reporting and alerts.

A particular group (or person) has responsibility for maintaining the quality of a dataset
Example from finance: market data, instrument static data
Part of the CDO function? Data steward role?
Many companies promote culture of “Data is our greatest asset”
Some dataset needs to be independently verified e.g. prices of obscure financial instruments

Often worth flagging a ‘missing’ value as:

Compare current values of results against a previous day, a benchmark or the source (raw) data sets
Compare counts and amounts
Set a tolerance e.g. 2% for counts, 10% for amounts
Significant differences are called breaks
Have a process for find & fix the root cause of the break

Often an automated daily process with sign-offs

Adjustments are a fact of life.

Ensure a properly followed and documented process, for example

A process to fix the underlying causes must be in place.

Comments available on the final dashboard.

Results based on adjusted data should be differentiated (different colour).

Describe the meaning and the semantics of the data and the process followed to transform the data

Open datasets often are supplied with metadata

‘Reproducible research’ approach – provide the code to transform the data e.g. as a ‘notebook’ so others can analyse and repeat

Assign a unique tag each item of source data that accompanies it through the data journey.

Aggregation of data loses tags but is often last stage of journey.

Drill through capabilities on visualisations are good for listing all the tags in a suspicious aggregated value.

Often useful in reconciliation process

Theory: manual processes are error prone (and worse) and need to be removed from any data journey
Similar approach adopted by large organisations: remove all end-user-computing (EUC) with strategic systems
Citizen data scientists may argue otherwise

bcbs 239 front page