Home | Power BI | Excel | Python | SQL | Generative AI | Visualising Data | Analysing Data
Data Quality Discussion
We may receive a large dataset, possibly in an Excel spreadsheet. The dataset often has many columns. Here are some suggestions about how to improve data quality in two stages: inspect the data and make a plan , then clean the data.
Inspect the data
Firstly inspect the data, and ask the question - is this reasonable? Here is a brief checklist
- What is the size of the table ? How many rows, columns?
- Is the data really tabular?
- are there any error values
- are there any missing values. If so, why?
- detect any outliers. Are these genuine or an error?
- are there any inconsistencies or, duplicates where we would not expect them
There is a detailed discussion of these at the bottom of the page.
Clean the data
When cleaning data, always ensure that you or someone else, can repeat or reproduce the series of steps. You must, for example:
- create a copy of the original dataset
- never modify and original columns. Instead, create a new (calculated column)
This is in the spirit of the NHS’s Reproducible Analytical Pipelines (RAP) strategy.
Essential: Firstly, format the data as an Excel table.
Here are some suggested approaches
- shorten and rename the columns - move any long text to a column description metadata (many tools have this)
- remove any junk rows
- fix any #ERR! style errors
- unpivot the data to go from wide to long (tidy) format.
- create new calculated columns
We may need to split the table into two (or more) if it is an “everything and the kitchen sink” table with many columns that have duplicated values. For example, one table containing patient details and also list of appointments needs to be split into two tables, patient and appointment, related by the unique patient identifier.
Here are more details about the checklist points
Is the data really tabular?
- Are the data in the rows consistent from row to row?
- If there are any calculated columns, is the same calculation used for every row?
- Does every cell have one and only one value?
- Are there any ‘junk’ rows i.e. rows that do not represent an item of data e.g. subtotals, extra headers, comment lines, footers?
- Is it two different types of data crammed into one table
Look at the data column by column and ask these questions:
- What is the data type of each column?
- Are the column names short and meaningful?
- Do date columns contain only dates?
- Do numeric columns only contain numbers?
- For categorical (text) columns, what is the distribution of the values, and the cardinality?
- For numerical columns, what is the min, max and average?
Are there any error cells? i.e. cells with #NA! or other errors?
Are there any missing values? If so:
- Is it reasonable to have missing values in this column?
- Do the missing values represent “not applicable” or “data not collected” or something else?
Is the data really tidy? Or is the same variable spread across several columns?
Are there any suspicious values, for example:
- possible mis-spellings
- outliers in numeric columns
- values that are much more frequent than we might reasonably expect?
Is the data denormalised - can we split it into a fact table with fewer columns and one or more dimension tables?
- see if columns with names that are likely to represent dimensional attributes e.g. NHS patient id, product, region,… There are often product classifications, people or entities (actors), geographical or date columns
- Are these value repeated? This would suggest that the column belongs in a dimension table