Data and AI Training

Home | Power BI | Excel | Python | SQL | Generative AI | Visualising Data | Analysing Data

An introduction to SQL

This session will introduce SQL , the language for retrieving (querying) data from a database and teach how to craft simple SQL statements to retrieve data then do some calculations and summaries. The presenter will demo the tools and editors to write and execute the SQL, and touch upon how generative AI tools are helping us to write SQL. The session aims to give you confidence to start using SQL in your day to day roles.

SQL stands for Structured (English) Query Language.

We start with why it is useful to learn SQL (at least conceptually).

SQL is the key to unlocking the world’s valuable data

We think of SQL in relation to relational databases, for example, Microsoft’s SQL Server and lots of other databases: Oracle, IBM’s DB2, Postgres, Maria, BiqQuery, Snowflake, MySQL and others. But SQL is also used to retrieve for data lakes e.g. lots of large CSV files, via with a “SQL end-point”. SQL is also used to read data from large data warehouses e.g. the SparkSQL language is based on SQL.

SQL and Database 101

SQL goes hand in glove with the de-facto pattern of organising data:

The database builder can constrain the data in the database, e.g., column datatypes, primary and foreign keys (which provide referential integrity), check constraints. Used properly, these contribute to enforcing good data quality.

For completeness, relational databases also contain:

We won’t talk about these today - perhaps in the next talk.

Sample Data

The data that we are going to use in our examples is in two tables. The fictitious PatientStay table lists patient stays in a few hospitals in London. Here are a few rows of that table

PatientId AdmittedDate DischargeDate Hospital Ward Tariff Ethnicity
1 2024-02-27 2024-03-06 Kingston Dermatology 2.4 White
2 2024-02-27 2024-03-04 Kingston Ophthalmology 8.2 Black British
3 2024-02-27 2024-03-08 PRUH Day Surgery 4.7 Asian
4 2024-02-27 2024-02-29 Oxleas Dermatology 8.3  
5 2024-02-27 2024-03-02 Oxleas Ophthalmology 5.1 White

The DimHospital table lists hospitals in London.

Hospital HospitalType HospitalSize Reach
Kings College Teaching Medium City
Kingston General Small County
Oxleas General Medium County
PRUH Teaching Small County

Notice that both tables have a column named Hospital and that these columns have the same set (domain) of values e.g. Kingston and PRUH. In the DimHospital table, the values of the Hospital column are unique - each row has a different value in this column.

SQL experts will complain (correctly) that I have used a business key as a primary key. I should have created a surrogate key.

Here is the structure of these two tables.

patientstay-hospital-er-diagram

The PatientID column in PatientStay and the Hospital column in DimHospital are in bold because they are the primary keys. The database will not allow duplicate or missing (NULL) values in these columns.

The icons to the left of the column names indicate the datatype of the column: these include number (int and float), date or text (varchar).

The line between the Hospital columns that show these two table have a relationship. The Hospital column in PatientStay is a foreign key that references the Hospital column, the primary key, in DimHospital. This ensures that there will never be a patient in a hospital that is not in the DimHospital table.

A good database administrator (DBA) will include primary and foreign keys in their database design. Check if they exist on your data.

Here is the database map (known as an entity relationship diagram) of a small but canonical database. The lines between the tables indicate that the DBA has set up primary and foreign key relationships between the tables.

sql chinook tables

You are going to use SQL in your career

SQL was invented in the early 1970s. That’s a long time ago! In 1970

SQL become an ANSI standard in 1986 - this was a big deal. SQL now had formal rules which has led to widespread adoption (but the AI in ANSI stands for ?).

SQL became an ISO standard in 1989 (and the I in ISO stands for ?).

I could have given exactly this talk 10 years ago (and will probably give it again in 10 years time!)

It is not usual for computer languages to have such a long lifespan

Why you will like SQL

SQL has many advantages, for example,

The best thing about SQL - you can stop relying on IT (and waiting for them) to generate those reports you need now!

You can automate tasks with SQL - reduce manual procedures.

Finally SQL skills are valued - SQL looks good on your CV.

The SQL SELECT statement

If we only need to know how use SQL to get data, all we need is the SELECT statement. The syntax is as follows.

SELECT
    <column>
    , <column>
    , <column>

FROM 
    <table>
WHERE 
    <where-clause>
ORDER BY 
    <column>
    , <column>;

Here is the simplest SQL statement and the first few rows of the data it returns. Note that we can write comments to explain why we are doing something. Single line comments start with a – and multi-line comments are surrounded with a /* and */.

--  The asterisk (pronounced 'star') means bring back all columns of the table
SELECT 
       * 
FROM dbo.PatientStay;
Patient ID AdmittedDate Hospital Ward Tariff Ethnicity DischargeDate
1 2024-02-26 Kingston Dermatology 2.2 White 2024-03-04
2 2024-02-26 Kingston Ophthalmology 8.3 Black British 2024-02-27
3 2024-02-26 PRUH Day Surgery 4.7 Asian 2024-02-28

Here is a more complex SQL statement and the data it returns.

SELECT
    ps.PatientId
    , ps.AdmittedDate
    , ps.DischargeDate
    , ps.Hospital
    , ps.Ward
    , DATEDIFF(DAY, ps.AdmittedDate, ps.DischargeDate) + 1 AS LengthOfStay
FROM
    PatientStay ps
WHERE
    ps.Hospital IN ( 'Oxleas', 'PRUH' )
    AND ps.AdmittedDate BETWEEN '2024-02-01' AND '2024-02-29'
    AND ps.Ward LIKE '%Surgery'
ORDER BY
    ps.AdmittedDate DESC
    , ps.PatientId DESC;
Patient ID AdmittedDate DischargeDate Hospital Ward LengthOfStay
150 2024-02-27 2024-03-06 Oxleas Day Surgery 9
130 2024-02-27 2024-03-05 Oxleas General Surgery 8
6 2024-02-26 2024-02-28 Oxleas Day Surgery 3
3 2024-02-26 2024-02-28 PRUH Day Surgery 3

A SQL SELECT statement with a GROUP BY and aggregate functions

A GROUP BY clause summarises data at a higher level. Here is the syntax.

SELECT <column>
       , <column>
       , AGG(<column>)

FROM <table>
WHERE <where-clause>
GROUP BY <column>, <column>
HAVING <having-clause>
ORDER BY <column>, <column>;

The SQL statement below does the following.

SELECT
    ps.Hospital
    , COUNT(*) AS NumberOfAdmissions
    , SUM(ps.Tariff) AS TotalTariff
FROM
    PatientStay ps
GROUP BY
    ps.Hospital
HAVING
    COUNT(*) > 10
ORDER BY
    NumberOfAdmissions DESC;

This is the result.

Hospital NumberOfAdmissions TotalTariff
Oxleas 15 96.1
Kings College 13 83.2
Kingston 12 70.9

That’s all we need to learn (for a single table)

We need only 5 operations on tabular data

Retrieve data across two tables into a single resultset with JOIN

SELECT
    ps.PatientId
    , ps.AdmittedDate
    , h.HospitalType
    , h.HospitalSize
FROM
    PatientStay ps
JOIN DimHospital h ON
    ps.Hospital = h.Hospital;

This SQL statement returns data that contains columns from both tables.

Patient ID Admitted Date Hospital Type Hospital Size
1 2024-02-26 General Small
2 2024-02-26 General Small
3 2024-02-26 Teaching Small
4 2024-02-26 General Medium

SQL editors

We need a SQL editor to write and run SQL. Examples are:

image

Some hints for writing SQL without tears

Use AI to solve SQL challenges

AI is a useful, practical, everyday tool, that we can use now to help us write SQL and learn SQL.

Generative AI large language models (LLMs) such as ChatGPT, Copilot, Claude, Gemini have free editions and are available from a browser.

Pair programmers help us write better SQL from inside the SQL editor e.g. GitHub Copilot inside VSCode editor

sql vscode copilot setup

Generative AI example

Write the SQL query based on the English language instruction.

Table PatientStay contains details about patient stays in hospital.
The columns are PatientId, AdmittedDate, DischargeDate, Hospital, Ward, Tariff and Ethnicity.
PatientId is the Primary Key, It is a whole number and must be unique.
AdmittedDate and DischargeDate are dates.
Hospital, Ward and Ethnicity are text fields, with a maximum length of 50 characters.
Tariff is a floating point number and represent the costs associated with that patient's stay in hospital.
All columns are mandatory, apart from DischargeDate and Ethnicity.

Here is an example prompt and response.

<example-prompt>

List patients admitted in February 2024 to a surgical ward (one ending with 'Surgery') in either the Oxleas or PRUH hospitals.
Show these details: patient id, admitted date, discharge date, hospital, ward.
Add another column LengthOfStay.
Sort by admitted date (latest first) then by Patient Id (highest first).

<example-prompt>

<example-response>

SELECT
    ps.PatientId
    , ps.AdmittedDate
    , ps.DischargeDate
    , ps.Hospital
    , ps.Ward
    , DATEDIFF(DAY, ps.AdmittedDate, ps.DischargeDate) AS LengthOfStay
FROM
    PatientStay ps
WHERE
    ps.Hospital IN ( 'Oxleas', 'PRUH' )
    AND ps.AdmittedDate BETWEEN '2024-02-01' AND '2024-02-29'
    AND ps.Ward LIKE '%Surgery'
ORDER BY
    ps.AdmittedDate DESC
    , ps.PatientId DESC;

</example-response>

As a custom GPT

Next steps in your SQL learning journey

The only way to learn is to practice. It’s best by far to practice your own data and SQL tools at work but if this is not possible, here are some alternatives

SQL database builders and data engineers have an interesting job

There are other helpful things that people who actually build databases need to consider (but we don’t). SQL can do all of this: