Using R With Small Data – Links

Here are the links and demos from my talk on ‘Using R With Small Data’ at the Infiniteconf in London on July 6th 2016.

This OneDrive folder contains the materials for the talk. The files are the slide deck (in PDF format) and the generated html from the R markdown document.

The European Banking Authority (EBA) 2016 Stress Test Results online map tool is here.

Data Insights ‘Visualising Financial Data’ Talk – Links

Here are the links and demos from my talk on ‘Visualising Financial Data Using Power BI and R’ at the Data Insights Summit.
The video recording on youtube is here.
The code for the demos are on my GitHub page .  They include
  • Text Mining of Corporate Responsibility Reports
  • Operational Risk And Controls Assessment (RCSA) Visualisations
  • Large Project Gantt Style Visualisations
  • Visualisations based on the Gapminder dataset
They include the Power BI Desktop files and the R Scripts.
The European Banking Authority (EBA) 2016 Stress Test Results are here.
The online map tool is here.  Hans Rosling’s famous YouTube video of ‘200 Years, 4 Minutes – The Joy of Stats’ is here.
The London Business Analytics Group meetup page is here.  This shows our upcoming events.  Recordings of a few of our previous talks are on the Skills Matter page here.

Links for ‘Banking On Machine Learning’ talk

I gave a talk to Microsoft Student Partners about incubating machine learning into a large organisation, in particular into a big bank.  The slide deck is here.  I mentioned several events and articles during the talk.  Here are the links.

Rohan Kopparapu (4th year UCL) and Microsoft Student Partner  wrote a blog post about the UCL Data Science Student Challenges.

The video recordings of the two recent talks  about building a data-driven culture in an organisation are  Transforming a Museum to be data-driven using R – Alice Daish and
From nothing to something, a local government journey- John Kelly.

The podcast of the BBC Radio 4 interview with Anthony Jenkins, ex-CEO of Barclays, and others about how fintech and ML are disrupting banking is available here.

I suggested some London-based meetups (LBAG, my one)

The Economist article on the change of culture at Microsoft and its focus on cloud and AI is here.

 Dr Andy Pardoe runs a comprehensive list of AI and ML resources at as well as a personal site.  He has been named a Top 30 AI influencer recently.

The abstract for the talk is below.

Banking On Machine Learning

Many of the audience will have recently attended a hackathon at UCL. Over a weekend participants applied Azure Machine Learning to financial datasets supplied by Bloomberg and derived some insightful results – a fantastic achievement. This was for many their first taste of a hackathon and data science.  However, for many organisations, getting to the stage where they are doing hard data science as part of a viable project is a long way into the journey of introducing and successfully taking advantage of ML.

This talk looks at the experience of starting ML at a few organisations then focuses on the challenges faced by big diversified banks.  We look at why banks are rushing to embrace ML and consider both the  opportunities and threats that ML poses.  We’ll consider the steps to introduce and incubate ML within the bank; through awareness, education, crowdsourcing, mobilisation (including hackathons) and finally implementation of ML as the new normal.

We’ll conclude that the successful introduction of ML into an organisation is about communication, culture, change, marketing, business analysis… and finally the hard data science of the sort that we saw in the hackathon.

Power BI March Update – Links

I gave the March round-up talk at the London Power BI User Group last night, before John Kelly’s inspiring talk, without the benefit of slide deck or demos due to the AV issues.  So here is my slide deck and links of the events and resources I mentioned during my talk. And I have posted  8 minute video to demo the Power BI desktop February and March updates on youtube, and also on OneDrive

The details of the SQLBI ‘Mastering DAX Workshop’ training in London on March 22-23 are here.

The PASS Business Analytics Marathon on March 29th features two Power BI talks from London PUG organisers; Prathy Kamasani at 6pm and David Moss at 7pm.  Register here.

The PBI motion scatter diagram of the Gartner Group Magic Quadrant of the last several years is here.

Now that we can theme our colours, the color brewer site is a very useful resource for selecting good palettes of categorical, sequential or diverging colours that forms best practice.

The Power BI team video of the March desktop update is here.

The data storytelling course by Albert Cairo is here.

The Economist produce wonderful charts on a daily basis in their Graphic Detail section of their website.

My London Business Analytics meetup group is having a talk on Wednesday next week (15th March).  The subject is transforming a museum to be data driven with R. Sign-up here.

BA Marathon: Using R To Clean And Transform Small Data – Links

This page contains the links and references for my talk “Using R To Clean And Transform Small Data” at the PASS BA Marathon on 14th December 2016.

The video recording and slide deck  are  here.

The Microsoft Professional Program Certificate in Data Science has two R courses.  These are  the Introduction To R For Data Science and Programming With R For Data Science.  I also recommend the DataCamp online R courses.

The London Business Analytics Group (LBAG) meetup page where you can see upcoming events is here and the videos of some of the past events are here.  The London Power BI User Group (PUG) meetup page is here.  Mark Butler’s blog of recent PUG and LBAG events is here.

Using R in Power BI Part (3) – Create a R Script Visual

The third way of using R in Power BI is to create an R Script visual.  This is especially useful when we want to create a visual such as small multiples that is not available (yet) in the suite of standard or custom PBI visuals.

 Image that we have a daily time-series for December 2015 for the profit and loss (P&L), and we want to show a trellis of bar charts; one chart  for each  region (Asia, EMEA and US) and one for a  Volcker flag (which is either In or Out). So we want to show six charts in all.   We also want to show a bar on each chart in a different colour for each of three clusters.   Given a data frame named dataset with date, Region, Volcker, Cluster and P&L columns, we can create a this in R using the ggplot2 library with the following code.

 ggplot R code

Once we have written and tested this code in RStudio, we can create a R Script Visual. The RScript visual provides  a dataframe (R’s equivalent of a table)  named dataset.  This contains  a column for each of the fields in the Values well – in this case Region, Cluster, VolckerFlag, Date and P&L.

 R Script Values Well

We can copy our R  snippet into the bottom of the R Script Visual code window. 

R script editor

 Once done we can see our chart in Power BI.

R visual


Using R in Power BI Part (2) – Execute A R Step in a Query

The second way of using R in Power BI is as a step in a query in the Query Editor. Imagine we have a Task query that has a TaskOwner column that contains a comma separated list of one or more owners – like the snapshot below.

Task Query

Our objective is to create a new query that contains the mapping of tasks to owners – this requires us to split the TaskOwner column and create a row for each task and owner.

Since we are creating an additional dataset, let’s start by duplicating the Task query and rename to TaskOwnerMap. We can now insert a R step (rightmost icon on the Transform ribbon below).

Run R Script button

This opens a dialog where we can paste our R code snippet.

run r script window

The code above uses the functions from the tidyr package. The separate function splits the TaskOwner columns into several columns, one for each owner. The gather function then unpivots these columns to create an attribute and value column. The  select chooses the columns required for our final dataset.  (It’s best to develop, debug and test in the RStudio or other development environment beforehand.)

This transforms the data as required – we’ve named our result dataframe and we expand this in the next step to see the columns.

QE Final TaskOwnerMap

This particular example (splitting a column) is more easily accomplished directly in the Query Editor than using R but I wanted to show a simple case to explain the mechanics of the process.

Using R in Power BI Part (1) – Load a R Script

There are three ways to use R in Power BI

  1. Load a R Script as a data source
  2. Add an R Script as a QE step – introduced in the July 2016 update
  3. Plot a R Visual

In this article we’ll look at the first of these.  The subsequent articles will look at the other two ways.  We’ll use as an example in a typical reporting and visualisation challenge taken from the area of report progress of task to meet certain regulatory requirements – in this instance the 11 principles for good aggregation and reporting of data issued in by the BCBS – details here if you are interested.

Imagine the source data looks like this.

Excel Compliance Task Data Source

This shows a row for each task with some properties (Key, Name, Start Date, End Date). It also shows which task are applicable for which regions (denoted by the x in the cell).  The set of regions is implicit in the column names.  It also shows the owners of each task – if there is more than one owner, these are in a comma separated-list. In a similar way, it also shows the principles covered by each task.  There are 11 principles and we know this list will only have integers between 1 and 11. We also have on a separate sheet the details of the principles

Excel Compliance Principle List

This is not a good structure for reporting and analysis.  We need to tease out the data into separate, related tables –

  • Task,
  • TaskPrincipleMap (to hold the many to many relationships between Tasks and Principles),
  • Region,
  • TaskRegionMap (again to hold the many to many relationships between Tasks and Regions)

and combine this with the Principle table to arrive at a data model that looks something like this

Target Data Map

The R script below uses the tidyr and dplyr packages to perform a sequence of data transformations to achieve exactly this.  These include projecting columns (select), filtering rows (filter), splitting a comma-separated column into individual values (spread) and unpivot (gather).  The result is a set of data frames (R’s equivalent of a table) that correspond to the tables required.

Compliance Data Transform R Code

We can paste this RScript into the “Load R Script“ window.

Get Data R Script


execute R script window

One of the nice features of this is that you can then choose exactly which dataframes created by the script to import – and that you can load several dataframes not just one.

Navigator Window

This then loads our data into the Query Editor below – each selected R data frame becomes a separate query.  (in the snapshot below, I have renamed the queries from R naming convention to more typical table / query names e.g. from  to TaskPrincipleMap)

QE view

In this case, we can do these operations directly in Power BI of course – but R has many advance statistical capabilities that are not in Power BI; decision tress, clustering, correlations.  Some of these are shown in the R Script Showcase here.

Visualising Operational Risks

Operational risks results differ from other types of financial risk results such as market or credit risk. The latter are defined in numerical ratio terms – examples include profit and loss, value-at-risk and exposure. We can add these measure together naturally.

However, operational risk results are usually expressed in ordinal terms. For example, an operational risk and controls self-assessment (RCSA) estimates the inherent risks, assesses the controls that are in place to avoid and mitigate these risks and then estimates the residual risks after these controls are applied. The effectiveness of each control is also estimated – typically on a scale from adequate, inadequate, vulnerable and worst-of-all not controlled.

There are a set of 7 industry standard well defined event type categories; internal fraud, external fraud, employment practices, clients & products, damage to physical assets, business disruption and execution & delivery. Each of these is broken down into sub-categories. For example, internal fraud splits into unauthorised trading, bribery and others. Furthermore, many parts of the bank submit a RCSA assessment and these may be organised into a hierarchy.

Risk Managers want to view these as heat maps. They would view initially at an aggregated level, for example, at event type category and across all regions. The aggregated values should be the “worst case” of the detail values so if a department has estimated low for misappropriation of assets, medium for mismarking and high for bribery then the internal fraud aggregated score would be high.

The snapshot below is a typical heatmap of RCSA scores, mocked up in Excel. The columns list the inherent risks, mitigating controls these residual risks. These are broken down by region. The rows list the standard operational risk event types

Mockup RCSA Heatmap

The data for this heatmap is a table – first few rows below.

Mockup RCSA Data

Let’s build this heatmap in Power BI Desktop. Our first problem is that these are ordinal results. L, M, H, VH values represent low, medium, high and very high scores for risks. Similarly, A, I, V, NC, values represent adequate, inadequate, vulnerable and not controlled for effectiveness of controls. So we add another dataset that helps us map these ordinal scores to a numeric weight to indicate the order of these scores. The actual values of those weights don’t matter, just the relative values.

Mockup RCSA Weights

We’ll merge these two datasets in the Query Editor to get a combined data set like this below

RCSA Query Editor Merged Data


We then create two measures

WorstCaseWeight = MAX(RCSAData[Weight])

This measure aggregates as worst-case manner as required.

WorstCaseScore = SWITCH([WorstCaseWeight], 10, “L”, 20, “M”, 30, “H”, 40, “VH”, 110, “A”,120, “I”, 130, “V”, 140, “NC”)

This measure maps our worst case weight into the appropriate ordinal score. Power BI does not allow a string field in the values well so we can’t add our score values directly but it does allow any measure – even if that measure returns a string.   This is very useful for us.

Now we can visualise our heatmap both at an aggregated and a detail level.

Mockup RCSA Heatmap in PBID - summary

Mockup RCSA Heatmap in PBID - detail


This gives us basically what we need with the exception of conditional formatting of the cells. We have a heat map without the “heat” which is unsatisfactory. What we need is to be able to map these categorical scores to colours and apply these colours to the cell background. This is not yet possible for the matrix visual, or indeed for any Power BI visual but I hope that this will be a features of upcoming update – or that someone will build a heat map custom visual.

Using the Colour Brewer With Conditional Formatting

The Power BI Desktop May Update introduced conditional formatting of numerical columns in tables. The snapshot below shows a table with two columns, one of which has had conditional formatting applied.

Table With Conditional Formatting

The conditional formatting dialog is a context menu on numeric columns in the values well

Conditiional Formatting Dialog

You can set the colours at the end points of the colour scale, the minimum or maximum values from a colour picker but also choose a customer RGB colour value (for example, #FF0000 representing pure red). There is a lot of choice but very little guidance on the best combination of colours. Fortunately the colour brewer website  can help here by suggesting a few good colour schemes.  This site was developed by Cynthia A. Brewer in the Geography department of Pennsylvania State University.

Colour Brewer Landing Page

The simplest approach is to take the RGB values of the first and last colours in a scheme and enter these as the custom colours for the minimum and maximum values.
For example, the PL column in the table above used # edf8fb as the colour for the minimum value and # 006d2c as the colour for the maximum value

Choosing a custom RGB colour

The conditional formatting dialog allows to choose a diverging colour scheme. This is useful for a range of values centred around zero where you want to emphasize large absolute values either positive or negative. For example, on a column of profit and loss values it is useful to highlight those periods with either large profits or losses.

Table With Diverging Formatting

This is the first appearance of conditional formatting and there are of course some improvements that suggest themselves. It would be helpful to be able to

  • To be able to set some defaults for personal, team or organisational colour schemes – currently every column has to be set up manually
  • To have a set of discrete colours mapped to ranges (as is provided by the colour brewer site)

In the longer run it would be very useful to have conditional formatting based on a categorical measure as well as a numeric range. For example, imagine a DAX calculation that returns either “Breach” or “OK” – we would like to highlight the breaches.  Of course we could fake this now by our DAX calculation return -1 or 1 for OK and Breach respectively.