Home | Power BI | Excel | Python | SQL | Visualising Data | Generative AI | Analysing Data Course - Home
Most of the data that we analyse will be tabular i.e., arranged in a table with rows and columns.
Different data professionals have different names for (nearly) the same thing and this may sometimes lead to confusion.
Before starting on our data modelling process, we consider the characteristics of the data. We normally analyse each variable (column) of the data in several ways
We can describe the set of values in a columns in many useful ways.
The datatype contains the values that can be placed in the column. Common datatypes are
Categorical variables represent distinct categories or groups. A categorical data can take a small set of values e.g. journey type, crime category.
We can split categorical variables into nominative and ordinal types. Nominal variables have no inherent order, for example,
Ordinal variables have a meaningful order, for example,
The x axis of this chart is categorical column
A hierarchy is a set of columns going from broad to narrow categories. There are often natural groupings or hierarchies in the data, such as:
Numbers can be classified into discrete or continuous types. (Categorical and ordinal variables are always discrete.)
Discrete data consists of distinct, separate values, often counted in whole numbers, for example
Discrete data is typically counted.
Continuous data can take any value within a given range. A continuous variable can take an infinite set of ordered values. Continuous variables are typically measured rather than counted, for example
Use a column chart (histogram) of the count of binned ranges to see the distribution.
Determine whether zero holds a specific meaning in the context of the data. For example,
Consider whether a variable can have negative values. This is common in financial data (for example, profit and loss) but less common in measures like height or weight.
If the dataset includes dates (and it usually does), consider whether the specific time of day is relevant to the analysis.
Often columns are constrained so that every row must have a value - no row can have a blank or NULL value. Ideally this is the case for most columns since it makes life simpler when we don’t have to consider blank values.
Hints:
Some columns must contain values that are unique. There can be no duplicate values.
The cardinality of a column is the number of distinct values in the column. Columns with few distinct values are low cardinality. Columns with many distinct values are high cardinality. Power BI is more performant with low cardinality columns.
Examples of a low cardinality column could be
Examples of a high cardinality column could be
Surrogate keys are synthetic values, often integer,s used for the primary key. These are useful when there is no natural primary key and often created by large IT systems.
For example, some systems use a date key, in a YYYYMMDD format e.g. 20250312 rather than a date e.g. 12-March-2025.
When creating a table, it is important to be able to articulate what the grain is. What does a single row in your table represent? Can you define this clearly?