Machine Learning with R – Barbara Fusinska

On one of the coldest nights of the year, 100 people ventured out to hear Barbara (Basia) Fusinska talk about Machine Learning with R at the London Business Analytics Group. If you missed it, Skills Matter, our hosts for the evening have recorded the event here. But if you want a quick summary in a few minutes, read on…

Barbara started by introducing machine learning (ML), gave a brief overview of R and then discussed three examples; classifying hand written digits, estimating values in a socio-economic dataset and clustering crimes in Chicago.

ML is statistics in steroids. ML uses data to find that pattern then uses that pattern (model) to predict results from similar data. Barbra uses the example of classifying film genres into either action or romance based on the number of kicks and kisses. Barbara described supervised and unsupervised. Unsupervised is the “wild, wild west” we can’t train the model and it is much more difficult to understand how effective these are.

Back to supervised learning, it’s important to choose good predicting factors – in the movie example perhaps the title, actors, script may have been better predictors that the number of kicks and kisses. Then you must choose the algorithm and then tune it and finally make it useful and visible and get it into production – it’s a hard job especially when data scientists and software developer seem to be different tribes. (In fact Richard Conway gave a talk about this very topic back in November – it is here.)

Because R takes both the good and the bad from the open source world it can introduce a little messiness. It was built by statisticians, not software engineers. So for example there are always several different ways of achieving the same thing. It’s a great language for exploratory data analysis and data wrangling. R can build a model and make beautiful visualisations. It’s easy to get started – just download R and the free version of RStudio.

Barbara explained a few data science algorithms both in a mathematical and intuitive sense; these are k nearest neighbours, Naïve Bayes and logistic regression.   We use certain metrics to tell us how well our models predict. Accuracy is the number of correct predictions divided by the total number of prediction. It’s not always the best measure – for example when testing for a rare disease, predicting that nobody has the disease is highly accurate but not very useful. Other measures such as precision, sensitivity and specificity may be more appropriate.
Our first machine learning task was to classify images of handwritten digits such as this one below.

The images had already converted into a CSV file. This had a row of 64 numbers to represent the image and the 65th column for the actual value (as classified by a person). Barbara showed just a few lines of R code to train a k-nearest neighbours model based on a dataset of about 4,000 examples then tested that model against a test dataset based on about 2000 examples. She showed a confusion matrix below. This compares the actual values in the test dataset against the values predicted by the model. As you can see the model is very accurate. The model has a few problem with 8, mistakenly predicting a 9 or a 3 in a few cases.

R has several datasets built in. One of these, the Prestige dataset has socio-economic data. R can quickly show the relationships between the variables with the pairs() function.

That helps us to decide the parameters to build a linear regression model to predict the prestige value given the other variables – education, income and the percentage of women in a profession and job type (blue collar, white collar or professional, indicated above by the colour of the dots). R can build a model in a single line of code – in fact building statistical models this is what R was built for. The R model can also tell us which of the variable are most useful in predicting prestige – and it’s important to look at these. There are some other statistical information to determine how well the data fits the model; R-squared, p-values, scale-location, residuals and QQ plots.

Barbara described the k-means algorithm and used it for clustering crimes in Chicago. R does a great job of clustering this data as then visualising it on the map below – and does it in very few lines of code.

Barbara expressed some scepticism about the usefulness of this – firstly it a moot point how many clusters to choose but more importantly if you can visualise the data then why use Machine Learning to cluster it? We could have easily split into cluster by drawing lines on the map by hand.  We should think hard about our data and our objectives as we crunch our data, build our models and visualise our plots and maps in R.

Barbara blogs at barbarafusinska.com, tweets at @BasiaFusinska. The R Scripts for the three examples are available at https://github.com/BasiaFusinska/RMachineLearning.