Home | Power BI | Excel | Python | SQL | Generative AI | Visualising Data | Analysing Data
Most of the data that we analyse will be tabular i.e., arranged in a table with rows and columns.
Different data professionals have different names for (nearly) the same thing and this may sometimes lead to confusion.
Before starting on our data modelling process, we consider the characteristics of the data. We normally analyse each variable (column) of the data in several ways
We can describe the set of values in a columns in many useful ways.
The datatype contains the values that can be placed in the column. Common datatypes are
Categorical variables represent distinct categories or groups. A categorical data can take a small set of values e.g. journey type, crime category.
We can split categorical variables into nominative and ordinal types. Nominal variables have no inherent order, for example,
Ordinal variables have a meaningful order, for example,
The x axis of this chart is categorical column
A hierarchy is a set of columns going from broad to narrow categories. There are often natural groupings or hierarchies in the data, such as:
Hierarchies often form the basis of a guided navigation path from summary to details.
Numbers can be classified into discrete or continuous types. (Categorical and ordinal variables are always discrete.)
Discrete data consists of distinct, separate values, often counted in whole numbers, for example
Discrete data is typically counted.
Continuous data can take any value within a given range. A continuous variable can take an infinite set of ordered values. Continuous variables are typically measured rather than counted, for example
The source data for this column chart is the number of bikes hires by TfL (the public transport system in London) on each and every day in the last decade. This is effectively a continuous variable since it can take any whole number between roughly 4,000 and 80,000. These numbers are binned into ranges and the column height represents the number of days for each range. This allows us to see the distribution.
Determine whether zero holds a specific meaning in the context of the data. For example,
Consider whether a variable can have negative values. This is common in financial data (for example, profit and loss) but less common in measures like height or weight.
If the dataset includes dates (and it usually does), consider whether the specific time of day is relevant to the analysis.
Often columns are constrained so that every row must have a value - no row can have a blank or NULL value. Ideally this is the case for most columns since it makes life simpler when we don’t have to consider blank values.
Hints:
Some columns must contain values that are unique. There can be no duplicate values.
The cardinality of a column is the number of distinct values in the column. Columns with few distinct values are low cardinality. Columns with many distinct values are high cardinality.
This is important often for the choice of visualisation. For example, we may not want to choose a bar chart with more than 10 bars (so a column with a higher cardinality would not be a good choice to place on the x-axis)
Examples of a low cardinality column could be
Examples of a high cardinality column could be
Cardinality is sometimes important when we design reports with certain software tools. For example, Power BI is more performant with low cardinality columns.
Surrogate keys are synthetic values, often integer,s used for the primary key. These are useful when there is no natural primary key and often created by large IT systems.
For example, some systems use a date key, in a YYYYMMDD format e.g. 20250312 rather than a date e.g. 12-March-2025.
When creating a table, it is important to be able to articulate what the grain is. What does a single row in your table represent? Can you define this clearly?