Using SQL windowing functions to calculate new and returning customers

Alberto Ferrari has provided a fascinating example of using DAX to solve a typical problem – given a transaction history, how do we determine the counts of new and returning customers on each day. He describes this in his DAX In Action video on the Project Botticelli web site.

There are some parallels between the way that the windowing functions in SQL and the DAX CALCULATE functions work. Both have the concept of the doing an aggregation for each row based on the values of that current row. I wondered what was the best way of achieving the same using SQL – and also if we could identify the status of each customer on each day.

Imagine that we are trading with three customers A, B and C over three days 1,2 and 3. On day 1, we trade with customers A and B and since we have not traded before by definition these are both new customers. On day 2 we trade with B and C; B is a returning customer and C is new. On day 3 we trade with A, B and D – A and B are returning customers and D is new.

Our trade table looks like this. (To keep it simple, I’ve omitted all other columns other than the two needed for such a hypothetical example.)
SELECT date_id, customer FROM dbo.trade

Trade Table

The SQL statement
SELECT date_id ,
customer ,
COUNT(*) OVER (PARTITION BY customer ORDER BY date_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS count_customers_to_date,
CASE WHEN COUNT(*) OVER (PARTITION BY customer ORDER BY date_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) = 1 THEN 'New' ELSE 'Returning' END AS customer_status
FROM dbo.trade
ORDER BY date_id, customer

limits the counts the number of rows with the same customer as the current row (“PARTITION BY customer”) but during the count orders the rows by date (“ORDER BY date_id”) then only counts up to the current date (“ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW”). New customers will always have a count of 1, returning customers will have a greater count. It generates these results.

It’s good to see how concise the SQL is but an interesting point how scalable this query would be given a table with seven million rows say rather than just seven.