Home | Power BI | Excel | Python | SQL | Generative AI | Visualising Data | Analysing Data
This accompanies a talk to introduce SQL to a wide audience. We start with why it is useful to learn SQL (at least conceptually).
We rend to think of SQL in relation to relational databases, for example:Microsoft’s SQL Server and lots of open source databases: Oracle, IBM’s DB2, Postgres, Maria, BiqQuery, Snowflake, MySQL …
But SQL is also used to retrieve for data lakes e.g. lots of large CSV files, via with a “SQL end-point”
And also used for to read data from large data warehouses e.g. the SparkSQL languages is based on SQL
The data that we are going to use in our examples is in two tables. The fictitious PatientStay table list patient stays in a few hospitals in London. Here are a few rows of that table
PatientId | AdmittedDate | DischargeDate | Hospital | Ward | Tariff | Ethnicity |
---|---|---|---|---|---|---|
1 | 2024-02-27 | 2024-03-06 | Kingston | Dermatology | 2 | White |
2 | 2024-02-27 | 2024-03-04 | Kingston | Opthamology | 8 | Black British |
3 | 2024-02-27 | 2024-03-08 | PRUH | Day Surgery | 4 | Asian |
4 | 2024-02-27 | 2024-02-29 | Oxleas | Dermatology | 8 | |
5 | 2024-02-27 | 2024-03-02 | Oxleas | Opthamology | 5 | White |
The DimHospital table lists hospitals in London.
Hospital | Hospital Type | Hospital Size | Reach |
---|---|---|---|
Kings College | Teaching | Medium | City |
Kingston | General | Small | County |
Oxleas | General | Medium | County |
PRUH | Teaching | Small | County |
Notice that both tables have a column named Hospital and that these columns have the same set (domain) of values e.g. Kingston, PRUH,… In the DimHospital table, the values of the Hospital column are unique - each row has a different value in this column.
SQL experts will complain (correctly) that I have used a business key as a primary key. I should have created a surrogate key.
Here is the structure of these two tables. Notice the line between the Hospital columns that show these are related. In fact it shows more that than this - it ensures that there will never be a patient in a hospital that is not in the DimHospital table.
SQL stands for Structured (English) Query Language
SQL goes hand in glove with the de-facto pattern of organising data:
Confusingly, different people call the same thing by different names.
Here is the database map (known as an entity relationship diagram) of a small but canonical database.
SQL was invented in the early 1970s. That’s a long time ago! In 1970
SQL become an ANSI standard in 1986 - this was a big deal. SQL now had formal rules which has led to widespread adoption (but the AI in ANSI stands for ?).
SQL became an ISO standard in 1989 (and the I in ISO stands for ?).
I could have given exactly this talk 10 years ago (and will probably give it again in 10 years time!)
It is not usual for computer languages to have such a long lifespan
SQL has many advantages, for example,
The best thing about SQL - you can stop relying on IT (and waiting for them) to generate those reports you need now!
You can automate tasks with SQL - reduce manual procedures.
Finally SQL skills are valued in finance - SQL looks good on your CV.
As finance logistics professionals or even as data analyst, we only need to know how use SQL to get data. This is the SELECT statement
There are other helpful things that people who actually build databases need to consider (but we don’t). SQL can do all of this:
The syntax is as follows.
SELECT
<column>
, <column>
, <column>
…
FROM <table>
WHERE <where-clause>
ORDER BY <column>;
Here is the simplest SQL statement and the first few rows of the data it returns.
SELECT
*
FROM dbo.PatientStay;
Patient ID | Admitted Date | Hospital | Ward | Tariff | Ethnicity | Discharge Date |
---|---|---|---|---|---|---|
1 | 2024-02-26 | Kingston | Dermatology | 2 | White | 2024-03-04 |
2 | 2024-02-26 | Kingston | Ophthalmology | 8 | Black British | 2024-02-27 |
3 | 2024-02-26 | PRUH | Day Surgery | 4 | Asian | 2024-02-28 |
Here is a more complex SQL statement and the data it returns.
SELECT
ps.PatientId
, ps.AdmittedDate
, ps.DischargeDate
, ps.Hospital
, ps.Ward
, DATEDIFF(DAY, ps.AdmittedDate, ps.DischargeDate) + 1 AS LengthOfStay
FROM
PatientStay ps
WHERE
ps.Hospital IN ( 'Oxleas', 'PRUH' )
AND ps.AdmittedDate BETWEEN '2024-02-01' AND '2024-02-28'
AND ps.Ward LIKE '%Surgery'
ORDER BY
ps.AdmittedDate DESC
, ps.PatientId DESC;
Patient ID | Admitted Date | Discharge Date | Hospital | Ward | Length of Stay |
---|---|---|---|---|---|
150 | 2024-02-27 | 2024-03-06 | Oxleas | Day Surgery | 9 |
130 | 2024-02-27 | 2024-03-05 | Oxleas | General Surgery | 8 |
6 | 2024-02-26 | 2024-02-28 | Oxleas | Day Surgery | 3 |
3 | 2024-02-26 | 2024-02-28 | PRUH | Day Surgery | 3 |
A GROUP BY clause summarises data at a higher level. Here is the syntax.
SELECT <column>
, <column>
, AGG(<column>)
…
FROM <table>
WHERE <where-clause>
GROUP BY <column>, <column>
HAVING <having-clause>
ORDER BY <column>, <column>;
The SQL statement below does the following.
SELECT
ps.Hospital
, COUNT(*) AS NumberOfAdmissions
, SUM(ps.Tariff) AS TotalTariff
FROM
PatientStay ps
GROUP BY
ps.Hospital
HAVING
COUNT(*) > 10
ORDER BY
NumberOfAdmissions DESC;
This is the result.
Hospital | Number of Admissions | Total Tariff |
---|---|---|
Oxleas | 15 | 96 |
Kings College | 13 | 83 |
Kingston | 12 | 70 |
That’s it (for a single table). We need only 5 operations on tabular data
SELECT
ps.PatientId
, ps.AdmittedDate
, h.HospitalType
, h.HospitalSize
FROM
PatientStay ps
JOIN DimHospital h ON
ps.Hospital = h.Hospital;
This SQL statement returns data that contains columns from both tables.
Patient ID | Admitted Date | Hospital Type | Hospital Size |
---|---|---|---|
1 | 2024-02-26 | General | Small |
2 | 2024-02-26 | General | Small |
3 | 2024-02-26 | Teaching | Small |
4 | 2024-02-26 | General | Medium |
We need a SQL editor to write and run SQL. Examples are:
SELECT ps.* FROM PatientStay ps;
AI is a useful, practical, everyday tool, that we can use now to help us write SQL and learn SQL.
Generative AI large language models (LLMs) such as ChatGPT, Copilot, Claude, Gemini have free editions and are available from a browser.
Pair programmers help us write better SQL from inside the SQL editor e.g. GitHub Copilot inside VSCode editor
The only way to learn is to practice. It’s best by far to practice your own data and SQL tools at work but if this is not possible, here are some alternatives