Data Training Courses

Home | Power BI | Excel | Python | SQL | Generative AI | Visualising Data | Analysing Data

How to clean messy data

Often, we receive a large dataset, possibly in an Excel spreadsheet. The dataset often has many columns. Here are some suggestions about how to check and clean the data.

Firstly inspect the data, and ask the question - is this reasonable?

What is the size of the table ? How many rows, columns?

Is the data really tabular?

  1. Are the data in the rows consistent from row to row?
  2. If there are any calculated columns, is the same calculation used for every row?
  3. Does every cell have one and only one value?
  4. Are there any ‘junk’ rows i.e. rows that do not represent an item of data e.g. subtotals, extra headers, comment lines, footers?
  5. Is it two different types of data crammed into one table

Look at the data column by column and ask these questions:

Are there any error cells? i.e. cells with #NA! or other errors?

Are there any missing values? If so:

Is the data really tidy? Or is the same variable spread across several columns?

Are there any suspicious values, for example:

Is the data denormalised - can we split it into a fact table with fewer columns and one or more dimension tables?

Act to clean the data

Sometimes a dataset does not have a numeric column to be summed as the quantitative measure. Each row may represent for example an event or an employee and there may be no natural column to sum up. In this case, create a new measure that counts the number of items. for example in Power BI’s DAX language

Number of Items = COUNTROWS('Fact'[Fact Id])