Home | Power BI | Excel | Python | SQL | Generative AI | Visualising Data | Analysing Data
Almost everyone knows the story of the Titanic. In April 1912, this magnificent ship left Southampton on its maiden voyage to New York, but it never arrived. It hit an iceberg in the Atlantic and sank. There were over 2,000 people on board. Less than half survived.
A century later, this Titanic dataset is a classic case study for rookie data scientist to build a predictive model to determine who is likely to survive or perish (ignoring the fact that this is a matter of historical record). However, we will visualise the data with Power BI and see if we can gain some intuition and who did and did not survive and why. We know from the film that Kate Winslet survived but poor old Leo DiCaprio did not – is that an accurate reflection?
This dataset contains a partial list of the passengers on board including variables (columns) such as Name, Age, Sex, and Pclass i.e. whether they travelled 1st, 2nd or 3rd class.
We need to clean the data. Some variables have missing values. The names of the variables are cryptic and so are the values. We’ll start by making the variable names and values easier to understand.
There are two files - the same data either in:
Here is a description of the columns:
PassengerId
Survived
, 0 = Died, 1 = SurvivedPclass
, Passenger ClassName
Sex
Age
SibSp
, Number of siblings (brothers/sisters) and spouses in the family group travelling with the passenger (excluding the passenger)Parch
, Number of parents and children in the family group travelling with the passenger (excluding the passenger)Ticket
Fare
Cabin
Embarked
, S = Southampton, Q = Queenstown, C = CherbourgImport the data in the Passengers tab of the ‘Titanic Data’ workbook.
Remove the Ticket and Cabin columns (we don’t need them in this exercise).
Rename the Sex column to Gender.
Add a new column named Survival, with two values Died or Survived based on the value of the Survived column (0 and 1 respectively). Remove the Survived column – we no longer need it.
Add a new column PassengerClass with values First, Second and Third.
Change the datatype of both PassengerClass and Survival from Any to Text.
Add a custom column, FamilySize, with formula = [SibSp]+[Parch]+1. Change the datatype to whole number.
In the Embarked Column, replace any blank values with “S”. Then replace S, C and Q values with Southampton, Cherbourg and Queenstown respectively.
Create a new measure. Use the formula PassengerCount = COUNTROWS(‘Passengers’)
Build some charts to show the number of passengers by gender, class etc,
Build a bar or ribbon chart to show the age distribution of passengers.