Home | Power BI | Excel | Python | SQL | Visualising Data | Generative AI
| Analysing Data Course - Home
Approaches to Improving Data Quality
- Why is data quality a problem?
- How to measure and improve data quality
Approaches to improving data quality
There are several approches we can take to improve data quality. These include:
- Validate
- Constrain
- Reconcile
- Document
- Certify
- Track
- Visualise
- Have a golden source
- Agree ownership
- Automate
- Adjust
Let’s look at each of these in turn.
Note: We’ll not talk about data protection: anonymise, secure, audit/log
Visualise
A good visualisation often shows up unanticipated data issues, not caught by pre-meditated rules.
With modern tools, exploratory visualisation is easy and quick
Validate
Check against business rules e.g.
- within a range of values?
- missing values?
- arriving within a certain time?
Inspect individual rows and totals - both counts and amounts.
Flag with exception reporting and alerts.
Certify
- Allow people to publish datasets
- Certify those datasets that meet quality standards
- Certified datasets have a tick mark in the list
Agree ownership
- A particular group (or person) has responsibility for maintaining the quality of a dataset
- Example from finance: market data, instrument static data
- Part of the CDO function? Data steward role?
- Many companies promote culture of “Data is our greatest asset”
- Some dataset needs to be independently verified e.g. prices of obscure financial instruments
Constrain / Enforce
- Apply rules that only allow valid data to enter the ‘data warehouse’
- Relational databases do this well:
- Datatype,
- No missing values
- Uniqueness constraints
- Lists of allowed values e.g. direction in North, South, East, West
- relational integrity between tables
Often worth flagging a ‘missing’ value as:
- Unexpectedly absent
- Not applicable for this product / category / class
Reconcile
- Compare current values of results against a previous day, a benchmark or the source (raw) data sets
- Compare counts and amounts
- Set a tolerance e.g. 2% for counts, 10% for amounts
- Significant differences are called breaks
- Have a process for find & fix the root cause of the break
Often an automated daily process with sign-offs
Adjustments are a fact of life.
Ensure a properly followed and documented process, for example
- 2 independent people sign off
- Adjustments must have an expiry date
- The reason why the adjustment must be stated
A process to fix the underlying causes must be in place.
Comments available on the final dashboard.
Results based on adjusted data should be differentiated (different colour).
Document
Describe the meaning and the semantics of the data and the process followed to transform the data
Open datasets often are supplied with metadata
‘Reproduceable research’ approach – provide the code to transform the data e.g. as a ‘notebook’ so others can analyse and repeat
Track (apply data lineage)
Assign a unique tag each item of source data that accompanies it through the data journey.
Aggregation of data loses tags but is often last stage of journey.
Drill through capabilities on visualisations are good for listing all the tags in a suspicious aggregated value.
Often useful in reconciliation process
Golden Source – Single Version of the Truth
- Certain data architectures have a single instance of every value or calculation
- In theory, no possibility of getting conflicting values
- Often known as ‘cubes’
- Cons: Can become large and complex for the business user to navigate.
Automate
- Theory: manual processes are error prone (and worse) and need to be removed from any data journey
- Similar approach adopted by large organisations: remove all end-user-computing (EUC) with strategic systems
- Citizen data scientists may argue otherwise
BCBS 239 Data, Aggregation & Reporting
- Governance
- Aggregation
- Integrity & accuracy
- Drill down
- Reporting
- Comprehensive
- Accurate
- Clear
- Timely
- Actionable
- Right audience