Risk management reporting is moving from overnight processing to delivery of reports and results in near real-time. The results need to be available within a few minutes rather than a few hours of the data becoming available to the risk data warehouse. For example, a market risk middle office (MRMO) team may be under “T+1” deadline pressure to produce daily results and reports within a few minutes once the final adjustments for the day have been loaded into the database.
One step on the way is achieving this is the use of in-memory column-based tables and indexes. Most BI vendors are now incorporating this technology. My tests used Microsoft’s implementation – the columnstore index in the latest version of its SQL Server database. This is designed to make typical SQL queries, used in analytics, run much, much faster.
The tests were designed to see if these techniques can work on typical risk data sets and if so what performance improvements are possible. I set up a database with a typical structure of a market risk data warehouse and a reasonable amount of data, 70 million rows. This represents typically 2-3 days of data for an asset class so you can run reports comparing results from the latest business day with the day before on a data set of this size.
I then created a set of SQL queries typically used to in risk reports. I ran these against the data with and without the columnstore Index applied. The results were impressive – over a suite of tests, the performance improvements were between 11 and 77 times. In the better cases, this translates into a 15 minute query becoming a 15 second one.
These techniques also avoid the step of “reprocessing the cube” so save valuable minutes here also. Cubes are useful for rich exploration analysis of the data but we don’t need them any longer for generating the basic results.
Alberto Ferrari has provided a fascinating example of using DAX to solve a typical problem – given a transaction history, how do we determine the counts of new and returning customers on each day. He describes this in his DAX In Action video on the Project Botticelli web site.
There are some parallels between the way that the windowing functions in SQL and the DAX CALCULATE functions work. Both have the concept of the doing an aggregation for each row based on the values of that current row. I wondered what was the best way of achieving the same using SQL – and also if we could identify the status of each customer on each day.
Imagine that we are trading with three customers A, B and C over three days 1,2 and 3. On day 1, we trade with customers A and B and since we have not traded before by definition these are both new customers. On day 2 we trade with B and C; B is a returning customer and C is new. On day 3 we trade with A, B and D – A and B are returning customers and D is new.
Our trade table looks like this. (To keep it simple, I’ve omitted all other columns other than the two needed for such a hypothetical example.)
SELECT date_id, customer FROM dbo.trade
The SQL statement
SELECT date_id ,
COUNT(*) OVER (PARTITION BY customer ORDER BY date_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS count_customers_to_date,
CASE WHEN COUNT(*) OVER (PARTITION BY customer ORDER BY date_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) = 1 THEN 'New' ELSE 'Returning' END AS customer_status
ORDER BY date_id, customer
limits the counts the number of rows with the same customer as the current row (“PARTITION BY customer”) but during the count orders the rows by date (“ORDER BY date_id”) then only counts up to the current date (“ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW”). New customers will always have a count of 1, returning customers will have a greater count. It generates these results.
It’s good to see how concise the SQL is but an interesting point how scalable this query would be given a table with seven million rows say rather than just seven.