Links for ‘A Month Of Predictive Analytics’

I gave a short talk titled ‘A Month Of Predictive Analytics’ to the first meeting of the wonderful Meetup Mashup group on 18th June 2015. Here are the resources mentioned in the talk.

My slide deck is here.

Recent News about AI Machine Learning

Datasets

Tools

Books

  • Data Smart – John W Foreman
  • Practical Data Science With R – Nina Zumel and John Mount
  • Data Science For Business – Foster Provost and Tom Fawcett
  • Applied Predictive Analytics – Dean Abbot

Online Courses

My webinar for PASS Business Analytics Virtual Chapter (BAVC) comparing tools for exploratory data analysis is here together with previous webinars presented at PASS BAVC.

SQL Saturday Conference at Edinburgh ‘Learning R’ – Resources

Here are the links from my ‘Learning R through a typical BI task’ talk at the SQL Saturday Conference in Edinburgh on June 13th 2015.

The OneDrive folder contains

  • the slide deck
  • the R source code file for the Titanic demo
  • the CSV file containing the Titanic passenger list data used in the code above
  • the R source code file for the Iris demo

The books I mentioned in the Resources slide are

  • Data Smart – John W Foreman
  • Practical Data Science With R – Nina Zumel and John Mount

The course I mentioned in the Resources slide are

A presentation on Azure ML from Microsoft’s Andrew Fryer and Amy Nicholson featuring a sample to predict flight delays is on Amy’s One Drive http://1drv.ms/1B798bz.

If you have any questions about the slide deck or the demo resources, please get in touch.

PASS BAVC Webinar ‘Tools for EDA’ – Resources

I gave a talk on ‘Tools for Exploratory Data Analysis’ to the PASS Business Analytics Virtual Chapter (BAVC) on June 11th 2015. Here are the links to resources and references.

The talk is on youtube here.

The OneDrive folder contains

  • a CSV file containing the Titanic passenger list data used in the example
  • the slide deck
  • the R source code file for the Titanic demo
  • the Excel 2013 workbook containing the PowerQuery and PowerView sheets for the Titanic demo
  • a folder contain the RapidMiner repositories for the Titanic demo
  • the CSV file containing the Titanic passenger list data used in the example
  • the free Azure Machine Learning book (PDF)

The books I mentioned in the Resources slide are

  • Data Science For Business – Foster Provost and Tom Fawcett
  • Data Smart – John W Foreman
  • Applied Predictive Analytics – Dean Abbot
  • Power Query for Power BI and Excel – Chris Webb
  • Practical Data Science With R – Nina Zumel and John Mount

This ML workspace contains the Azure ML experiments. You can copy the experiments into your own ML workspace.

A presentation on Azure ML from Microsoft’s Andrew Fryer and Amy Nicholson featuring a sample to predict flight delays is on Amy’s One Drive http://1drv.ms/1B798bz.
If you have any questions about the slide deck or the demo resources, please get in touch. My contact details are mark.wilcock@zomalex.co.uk or on LinkedIn athttps://uk.linkedin.com/in/zomalex.

PASS BA April 2015 “What If Scenarios” – Slide deck and Sample Code

I gave a presentation about “What If Scenarios” at the PASS Business Analytics Conference in April 2015. Here are the slide desk and the zip file containing the code samples
The zip file contains:

  • the spreadsheet for the historical analysis of population and emissions. This includes the PowerPivot model and the Power View visualisations.
  • the R code to project emissions in 2010 through to 2050 under various scenarios
  • the “GC building blocks” spreadsheet that contains examples of the common Excel formulas used in the Global Calculator (SUMIF(s), INDIRECT, INDEX and MATCH)

The global calculator can be found globalcalculator.org including the online tool, downloadable Excel spreadsheet and lots of great documentation.

Slides and sample code from “What If In Excel & R” presentation

I gave a talk to London Business Analytics Group on “Using R And Excel To Explore “What If” Scenarios” recently.

The slide deck from the talk is here.

A zip file of the R sample code for the 2050 project example is here. This include the R code and also a RData object that holds the 2010 emissions data, loaded by the code

I showed the Global Calculator, a model of the world’s energy and food systems implemented as an Excel spreadsheet and also accessible through a web page. To download the spreadsheet or use the web application, start at the Global Calculator landing page.

The spreadsheet that explains and demonstrates the Excel formulae used in the calculator, SUMIFS, INDEX, MATCH and INDIRECT is here.

A layman’s guide to BCBS239

This is a draft for a talk I’m giving in a few months – comments welcome

In January 2013, the Basel Committee on Banking Supervision (BCBS) issued a document titled “Principles for effective risk data aggregation and risk reporting”. It is generally known as “BCBS 239”.The document sets out guidelines for effective reporting of risk to the board and senior management of banks.

So does BCBS 239 matter? Well, if a bank is a SIFI (systematically important financial institution), big enough so that its failure could cause instability in the financial system then it does matter since such banks need to comply with the guidelines, some as early as 2016. But I would argue these regulations are important even if a firm is not a SIFI simply because BCBS 239 describes the best practice of this important topic and does it concisely and clearly. The document says that one of its objectives is enhanced risk management and decision making, and to improve strategic planning – for the ultimate good of the firm and its shareholders.

The BCBS has the interests of the taxpayers rather than shareholders in mind. Its motivation for issuing this document arise from the crisis of 2007 when some Risk departments could not cope and it was impossible to get good risk information for proper decision. In any future crisis, the regulators want the good time risk information to be able to wind-up the bank or perhaps organise a recovery.

The document defines 14 principles and groups those into 4 areas; governance, aggregation, reporting and supervisory review. The table below summarises these.

Area Principle   # Summary
Governance and Infrastructure 12 Strong governance “in the round”Risk IT must be able to cope in a crisis
Aggregation 34

5

6

Achieve integrity and accuracy via automationCompleteness at group level but also drill down to entity

Timeliness-especially in the crisis

Adaptability-for changing circumstances

Reporting 78

9

10

11

Accuracy through reconciliation and validationComprehensiveness – include all material risks

Clarity – to facilitate informed decision making

Frequency of reporting-appropriate to nature of risk

Distribution to right people (and only them)

Supervisory Review 1213

14

Periodic review and spot testsSupervisor’s power to require remedial action

Supervisor home/host co-operation

Governance and management controls and processes are the overriding preoccupation; these include controls, roles, ownership of data and independent validation. The document draws a rather arbitrary distinction between aggregation and reporting. Aggregation is the gathering all the data together which is a pre-requisite for reporting – providing useful, relevant summaries to inform better decisions. Supervisory review describes the supervisors’ powers to do periodic assessments (including spot tests) and to require the banks to perform remedial action if they fail these tests.

These are several themes that resonate through the document. There are many references to ensure risk reporting not only works well in normal circumstances but also in a crisis situation – clearly this reflects the experience of the 2007 crisis.

The board and senior management must be aware and in control of all aspects of risk reporting. They must define the information they need and must be aware of the weaknesses and omissions in the reports they receive. This holds true even if they delegate parts of the process to third parties. The buck stops with the board.

The document states the ‘what’ but not the ‘how’. There are very few hard and fast rules on how the guidelines should be implemented. The only occasions when the guidelines are explicit are to state that banks must be able to report certain risks intra-day in a crisis situation and to stipulate the minimum content for a risk report. But for the rest, well it depends on what’s appropriate in the firm’s situation.

The BCBS prefer automation of risk data processes. They have a strong suspicion of any manual process – these should be the temporary (and well documented and independently validated) exception rather than the rule.

“Forward looking” reports are encouraged. These include stress test and scenario-based reports.

The entire BCBS239 can be read from start to end in an hour – it is less than 30 pages. For those in hurry, I would suggest starting with Annex 1 which very clearly defines the terms – so you appreciate BSBS’s distinction between key terms e.g. the difference between accuracy and precision or between completeness and comprehensiveness. Annex 2 is a summary of the 14 principles. For most people, this will be enough but for more detail, the initial sections cover the motivation behind these regulations, the context and then more background to the principles.

BCBS 239 reflects a traditional reporting situation – it assumes that senior management receives their risk data as canned reports – hence their distinction between aggregation and reporting. It assumes a world where the summary risk data is selected as communicated to the board by risk management and risk IT. It does not consider the emerging world of selfreliant and self-service reporting where risk managers have good tools to visually explore and analyse the data. This is probably a reflection of the current status quo in most of the banks.

In summary, BCBS 239 is a very useful document and not only for those firms for which the regulations apply. It provides well thought-out, well drafted guidelines with implications for all processes involved in ensuring that the board of a firm can take well informed decisions about the management of their financial risk.

Analysis of London SW12 property price history

I live in a part of south London called Balham. It’s about 6 miles south of the central area and part of a postal district named SW12 (the 12th SouthWest district in London). It was famous a long time go for being lampooned in a radio comedy as “Balham, Gateway to the South” but things have changed. I’ve lived here over 20 years and property prices have shot up and I wanted to analyse this.

The UK Land Registry provide details of every one of the 18 million property transactions since 1995 in one rather large CSV file which can be found here. This contains the price paid, the date of transfer, the postcode and the house number. An example of a postcode is SW12 0EN. The UK postcode narrows down to a set of about 20 properties, usually on the same street – together with the house number it uniquely identifies the property. There are over 600 postcodes in SW12 and over 1 million in the UK.

My friends at the Information Lab have kindly made publicly available a dataset that geocodes every postcode in the UK. The details are in this blog post. Geocoding ties a location, in this case a postcode, to a latitude/longitude so Tableau can map it. So now I can plot all the house purchases on a map.

This compares the property sales in prices in 1995 and 2013. The size of each circle represents the purchase price. Hover over a particular spot to see the sales details.

This motion chart shows the 14,000 property transactions since 1995, year by year. Running it through the years gives an impression of the activity, including the dip in 2008 due to the credit crisis.

I’m only interested in my neck of the woods but the Land Registry data obviously covers the whole country.

Obviously you can do a lot more with this data – for example filter by price paid only to look at properties within a certain price band. Or you can look at those properties that have been bought & sold several times in the last 18 years and use these to calculate average annual price increases.

Charting A Mashup Stress Scenario using R

From the previous post we have our mashup stress scenario data in R and now we’d like to plot it. We could of course save the R data to database or file, import into a data visualisation tool such as Tableau and chart it there. However, R has some powerful charting capabilities and this article will demonstrate these.

Let’s create a lattice chart with each of the five scenarios in a different panel. In each panel, we’ll show the stress P&L by counterparty. Firstly, we need to summarise the data. Figure 1 shows the data frame.
Figure 1
Figure 1

Figure 2 shows the R code to sum the stress P&L and group by the two categories we want to see in our plot (counterparty and scenario) and to build the plot.
Figure 2
Figure 2

Figure 3 shows the resulting plot.
Figure 3
Figure 3

We may like to see a different perspective on the data – perhaps as a clustered bar chart showing the stress P&L by scenario for each counterparty. Figure 4 shows the code to generate the plot and Figure 5 the result.

Figure 4
Figure 4

Figure 5
Figure 5

This article just scratches the surface of R’s plotting capabilities. It’s also possible to create box plots, line charts and scatter plots, to have a great deal of precision on the detail of each chart, to layout several charts on a single page and to produce charts in various formats including pdf.

Calculating A Mashup Stress Scenario Using R

A previous article described the business case for creating a mashup scenario quickly.  Often, this needs to be a self-service solution created by risk managers or business analysts themselves. One approach to doing this is to use R, an open source language for statistical use that has very good capabilities for transforming data. This article will use R to calculate the mashup scenario. The next article will use R to provide graphs of the results.

The first step is to load the stress data of the four original scenarios into an R data frame (R’s equivalent of a database table).  Figure 1 shows some of the stress data in an R data frame.
Figure 1
Figure 1

Now we can build the mashup scenario in a few steps.  The previous post described the business rules.  The first rule is to copy over trades from the Grexit scenario where either the issuer or counterparty is from Greece.  Figure 2 shows the single line of code necessary.
Figure 2
Figure 2

The next step is to add trades from the Rally scenario where the instrument currency is in Euros.  However, as this is not the first step, we have to be careful not to copy over any trades already in the mashup scenario.  Figure 3 shows the code.
Figure 3
Figure 3

The following steps to append the trades from the other two scenarios are similar.  The final steps are just tidying up the scenario names and appending the mashup data set onto the original data set.  Figure 4 shows the newly created mashup scenario rows
Figure 4
Figure 4

Figure 5 shows the code in its entirety.  It shows how R can perform quite complicated operations very concisely. 
Figure 5
Figure 5

Moving towards real-time risk reports using in-memory column-based structures

Risk management reporting is moving from overnight processing to delivery of reports and results in near real-time. The results need to be available within a few minutes rather than a few hours of the data becoming available to the risk data warehouse. For example, a market risk middle office (MRMO) team may be under “T+1” deadline pressure to produce daily results and reports within a few minutes once the final adjustments for the day have been loaded into the database.

One step on the way is achieving this is the use of in-memory column-based tables and indexes. Most BI vendors are now incorporating this technology. My tests used Microsoft’s implementation – the columnstore index in the latest version of its SQL Server database. This is designed to make typical SQL queries, used in analytics, run much, much faster.

The tests were designed to see if these techniques can work on typical risk data sets and if so what performance improvements are possible. I set up a database with a typical structure of a market risk data warehouse and a reasonable amount of data, 70 million rows. This represents typically 2-3 days of data for an asset class so you can run reports comparing results from the latest business day with the day before on a data set of this size.

I then created a set of SQL queries typically used to in risk reports. I ran these against the data with and without the columnstore Index applied. The results were impressive – over a suite of tests, the performance improvements were between 11 and 77 times. In the better cases, this translates into a 15 minute query becoming a 15 second one.

These techniques also avoid the step of “reprocessing the cube” so save valuable minutes here also. Cubes are useful for rich exploration analysis of the data but we don’t need them any longer for generating the basic results.