Data and AI Training

Home | Power BI | Excel | Python | SQL | Generative AI | Visualising Data | Analysing Data

Introduction to SQL

This accompanies a talk to introduce SQL to a wide audience. We start with why it is useful to learn SQL (at least conceptually).

SQL is the key to unlocking the world’s valuable data

We rend to think of SQL in relation to relational databases, for example:Microsoft’s SQL Server and lots of open source databases: Oracle, IBM’s DB2, Postgres, Maria, BiqQuery, Snowflake, MySQL …

But SQL is also used to retrieve for data lakes e.g. lots of large CSV files, via with a “SQL end-point”

And also used for to read data from large data warehouses e.g. the SparkSQL languages is based on SQL

Sample Data

The data that we are going to use in our examples is in two tables. The fictitious PatientStay table list patient stays in a few hospitals in London. Here are a few rows of that table

PatientId AdmittedDate DischargeDate Hospital Ward Tariff Ethnicity
1 2024-02-27 2024-03-06 Kingston Dermatology 2 White
2 2024-02-27 2024-03-04 Kingston Opthamology 8 Black British
3 2024-02-27 2024-03-08 PRUH Day Surgery 4 Asian
4 2024-02-27 2024-02-29 Oxleas Dermatology 8  
5 2024-02-27 2024-03-02 Oxleas Opthamology 5 White

The DimHospital table lists hospitals in London.

Hospital Hospital Type Hospital Size Reach
Kings College Teaching Medium City
Kingston General Small County
Oxleas General Medium County
PRUH Teaching Small County

Notice that both tables have a column named Hospital and that these columns have the same set (domain) of values e.g. Kingston, PRUH,… In the DimHospital table, the values of the Hospital column are unique - each row has a different value in this column.

SQL experts will complain (correctly) that I have used a business key as a primary key. I should have created a surrogate key.

Here is the structure of these two tables. Notice the line between the Hospital columns that show these are related. In fact it shows more that than this - it ensures that there will never be a patient in a hospital that is not in the DimHospital table.

patientstay-hospital-er-diagram

SQL and Database 101

SQL stands for Structured (English) Query Language

SQL goes hand in glove with the de-facto pattern of organising data:

Confusingly, different people call the same thing by different names.

Here is the database map (known as an entity relationship diagram) of a small but canonical database.

sql chinook tables

You are going to use SQL in your career

SQL was invented in the early 1970s. That’s a long time ago! In 1970

SQL become an ANSI standard in 1986 - this was a big deal. SQL now had formal rules which has led to widespread adoption (but the AI in ANSI stands for ?).

SQL became an ISO standard in 1989 (and the I in ISO stands for ?).

I could have given exactly this talk 10 years ago (and will probably give it again in 10 years time!)

It is not usual for computer languages to have such a long lifespan

Why you will like SQL

SQL has many advantages, for example,

The best thing about SQL - you can stop relying on IT (and waiting for them) to generate those reports you need now!

You can automate tasks with SQL - reduce manual procedures.

Finally SQL skills are valued in finance - SQL looks good on your CV.

The SQL SELECT statement

As finance logistics professionals or even as data analyst, we only need to know how use SQL to get data. This is the SELECT statement

There are other helpful things that people who actually build databases need to consider (but we don’t). SQL can do all of this:

The structure of a SQL SELECT statement

The syntax is as follows.

SELECT
    <column>
    , <column>
    , <column>

FROM <table>
WHERE <where-clause>
ORDER BY <column>;

Here is the simplest SQL statement and the first few rows of the data it returns.

SELECT 
       *
FROM dbo.PatientStay;
Patient ID Admitted Date Hospital Ward Tariff Ethnicity Discharge Date
1 2024-02-26 Kingston Dermatology 2 White 2024-03-04
2 2024-02-26 Kingston Ophthalmology 8 Black British 2024-02-27
3 2024-02-26 PRUH Day Surgery 4 Asian 2024-02-28

Here is a more complex SQL statement and the data it returns.

SELECT
	ps.PatientId
	, ps.AdmittedDate
	, ps.DischargeDate
	, ps.Hospital
	, ps.Ward
	, DATEDIFF(DAY, ps.AdmittedDate, ps.DischargeDate) + 1 AS LengthOfStay
FROM
	PatientStay ps
WHERE
	ps.Hospital IN ( 'Oxleas', 'PRUH' )
	AND ps.AdmittedDate BETWEEN '2024-02-01' AND '2024-02-28'
	AND ps.Ward LIKE '%Surgery'
ORDER BY
	ps.AdmittedDate DESC
	, ps.PatientId DESC;
Patient ID Admitted Date Discharge Date Hospital Ward Length of Stay
150 2024-02-27 2024-03-06 Oxleas Day Surgery 9
130 2024-02-27 2024-03-05 Oxleas General Surgery 8
6 2024-02-26 2024-02-28 Oxleas Day Surgery 3
3 2024-02-26 2024-02-28 PRUH Day Surgery 3

A SQL SELECT statement with a GROUP BY and aggregate functions

A GROUP BY clause summarises data at a higher level. Here is the syntax.

SELECT <column>
       , <column>
       , AGG(<column>)

FROM <table>
WHERE <where-clause>
GROUP BY <column>, <column>
HAVING <having-clause>
ORDER BY <column>, <column>;

The SQL statement below does the following.

SELECT
	ps.Hospital
	, COUNT(*) AS NumberOfAdmissions
	, SUM(ps.Tariff) AS TotalTariff
FROM
	PatientStay ps
GROUP BY
	ps.Hospital
HAVING
	COUNT(*) > 10
ORDER BY
	NumberOfAdmissions DESC;

This is the result.

Hospital Number of Admissions Total Tariff
Oxleas 15 96
Kings College 13 83
Kingston 12 70

We’ve now completed our SQL Foundation

That’s it (for a single table). We need only 5 operations on tabular data

Retrieve data across two tables into a single dataset with JOIN

SELECT
	ps.PatientId
	, ps.AdmittedDate
	, h.HospitalType
	, h.HospitalSize
FROM
	PatientStay ps
JOIN DimHospital h ON
	ps.Hospital = h.Hospital;

This SQL statement returns data that contains columns from both tables.

Patient ID Admitted Date Hospital Type Hospital Size
1 2024-02-26 General Small
2 2024-02-26 General Small
3 2024-02-26 Teaching Small
4 2024-02-26 General Medium

SQL editors

We need a SQL editor to write and run SQL. Examples are:

image

Some hints for writing SQL without tears

Use AI to solve SQL challenges

AI is a useful, practical, everyday tool, that we can use now to help us write SQL and learn SQL.

Generative AI large language models (LLMs) such as ChatGPT, Copilot, Claude, Gemini have free editions and are available from a browser.

Pair programmers help us write better SQL from inside the SQL editor e.g. GitHub Copilot inside VSCode editor

sql vscode copilot setup

Next steps in your SQL learning journey

The only way to learn is to practice. It’s best by far to practice your own data and SQL tools at work but if this is not possible, here are some alternatives