R vs Python: memory use

I've recently been trying to build some machine learning skills by working on prediction challenges from Kaggle. Getting some R skills was also a goal. I was trying to work on the San Francisco Crime project and planned to start by growing a Random Forest for the full training set dropping some of the predictors (n=878050, p=5).

After debugging the code with a small sample of the data, I ran with the full dataset and my laptop ran out of memory. I switched to the Google Cloud Compute instance I'm trialling: 2 CPUs, 13GB memory + 7GB of swap. Still not enough memory to build a single-tree Random Forest in R. Next I took the training data and tried to build Random Forests for one of the 10 Districts. Again I couldn't build the forest due to insufficient memory on the big Google instance I had.

Thinking this was getting ridiculous, I ported the code to Python to use Scikit-Learn. I ran the same single-tree random forest; one forest for each district in memory at the same time and the model built fine on my low-spec laptop sharing memory with Chromium, OpenOffice, etc. I intentionally avoided optimising the Python code so it had no head-start on the R code. These are the two versions of the random forest:

analyse.R
analysis.py

Big R commands

Run and walk away from the terminal with:

Set up more swap memory (7Gb in this case):

In this case the memory is instantly available but not on reboot, unless added to fstab with this line:

Branching and merging in Git

Workflow for branching and merging:

Use BFG to clear secret Git history

Pasted those nuclear launch codes in to a file and committed the damn thing to Git again? Well you can lose a bunch of time doing it with plain git commands or download BFG, then use these simple commands once you've pushed a clean version of the file:

Now delete your main repository and get a fresh clone from GitHub or wherever.

Split a GitHub folder in to a separate repo

You have a repository on Git called "all" and within that is a folder "folder_to_become_repo". You want to break this folder out to be a separate repository in your github account. Obviously, use your account name, not noelevans. Use these commands to make the change:

You'll be prompted for your github account name and password when you do the push. After this you have a new repo in Github containing all of the contents from the directory "folder_to_become_repo". The history has been migrated to the new repo too!

Now clone the repository in your normal workspace and it's ready to go!

Quick note: Statistical Learning terms

Cross-validation

Is it k-fold (normally 5 or 10) or Leave-one-out cross-validation (LOOCV). K-fold is the commonly chosen option.

Ridge Regression

Method for linear and other regression forms to limit the number of parameters we use to model the system. The more parameters, the worse the weighting of the parameters. Drives models to be more simple

Lasso Model

An improvement on Ridge Regression to minimise the number of predictors used to build a model - making it a sparse rather than dense model

Big Data Exchange 2014

On 26 November 2104, I attended the Big Data Exchange held at Skills Matter, London.  At the time, I was becoming more interested in using programming skills in more analytical / quantitative ways rather than the traditional software development I had been doing for the past 7.5 years. I recently finished a Coursera module in Data Science and my last textbook on the field, was John Foreman's (excellent) book, Data Smart.

I saw great talks at the conference from NASA's Chief Data Scientist, interesting Excel add-on tools from Felienne Hermans among others. The talks I thought were most interesting were the Statistical methods described by Abigail Lebrecht of uSwitch.com and Andrew Clegg's presentation on clearer, more honest ways to visualise data.

Automating answers to big data traffic questions - Abigail Lebrecht

Lebrecht's talk stemmed from the commercial Statistician's daily problem: making time for deeper investigations in to user data when there are daily / sporadic requests from management to explain why User Visits and (Browser to Buyer) Conversions drop at a given time.

Firstly, the point of showing Statistical Significance with data was made: drops should be seen not in absolute terms; they should take in to account the likelihood of Error / random fluctuation which might account for the change.  uSwitch employ Change Point Detection to identify when metrics vary beyond expected tolerances.

Change point detection: tries to identify times when the probability distribution of a stochastic process or time series changes [Wikipedia]

Specifically, the uSwitch Analytics team use Bayesian Changepoint Detection which yields probabilities that a change point in a time-series is a systematic change rather than a random fluctuation: changes to the Series below a threshold would indicate just noise. The speaker said the threshold to indicate a statistically significant change was a probability > 0.4 as their team's rule of thumb.  The speaker recommended this paper which explains the technique: Bayesian Online Changepoint Detection.  BCD is implemented in the BCP library of R which returns a posterior.prob value; the likelihood of the change being significant.

This detection method is used to identify changes in visits / customer-conversion to specific products.  If significant changes do occur, uSwitch analyse Twitter sentiment on keywords associated with the specific product to find correlation to the sales data.  Lebrecht mentioned the use of Machine Learning techniques and then described Natural Language Processing processes to analyse Tweets.  It was explained that a Stop Word List is needed – a list of words that should not be analysed for sentiment as they are the subject of the Tweet.  Eg. In a Tweet “I think Matlab is an excellent language” the sentiment / occurrence-count of "Matlab" is not relevant.  It does matter that Tweets about Matlab with the word "excellent" occur more or less frequently with time.  The Stop Word List should also include words like "because", "where", "think" which come up frequently in a sentence but give no indication of sentiment.

Lebrecht then described building Heatmaps to visually show the change of language with time.  The 100 most frequently used terms are listed against a date range.  The matrix this forms is then coloured to show the frequency of each word’s use.  A darker colour indicates greater use of the corresponding word.  Analysts then try to use the matrix to get an idea of the phenomena at play when the change in sales data occurred.

Finally, the speaker discussed attributing a sale to advertising methods employed to bring about the sale: if a user came to abc.com four times before making a purchase, how do you weight the importance of each advert and weight allocation of advertising budget in the future?  The environment that contributes to a sale can be complex – likened to the effectiveness of a player in a football team: a mid-field player may score few goals but may be an integral part of a team that creates many opportunities for others.  The Shapley Value is a measure of this usefulness and can be assigned to football players or advertising providers.  Lebrecht also stated the Beta Distribution is another good - more simple - model for attribution.  The Beta Distribution weights significantly more importance to the first and last adverts, giving a small amount of credit to the adverts in the middle.  Sales' attribution seems to be a hotter topic of current research.  This paper was mentioned as influential in the field: Time-weighted attribution of revenue to multiple e-commerce marketing channels in the customer journey.

Lies, Damned Lies & Dataviz - Andrew Clegg

Clegg’s talk was about the use the connection between raw data and summarisation – how simplification of the data may not always express the same conclusion. Many different points were made including soft skill aspects to data representation: keeping visualisations simple.

I found Anscombe’s Quartet a very interesting to learn about – how these four sets of data differ greatly in raw terms and yet when summarised by mean, SD, linear regression, the sets are apparently identical.  It should be clear that in only the top-left plot is the determined line of best fit drawn through the appropriate for the data.

Anscombe's Quartet (source: Wikipedia)

I also learned about the Zipfian distribution which fits many uses, including these examples:

Number of posts a user makes on Stackoverflow

  • Number of Retweets a sample of Tweets will receive
  • The frequency that a random set of words occur in a random sample of text

For data that fits this distribution, averages and values derived from them are inappropriate – the distribution is asymptotic as x ← 0 and will have a gigantic tail which will also skew averages.

Using this Python code snippet I made this sample distribution using the Zipfian function.

Clegg’s talk then moved on to different types of charts. Many different examples were given to show that Pie Charts are very poor at clearly describing data and they get worse still when drawn in 3D (worst when separating the slices from the “Pie”).

I had never seen Violin Plots before which – for a technical user – can very clearly give the impression of distributions' change through a more complex data set.  There are some good examples and comparison with Box Plots here.

Example violin plot (source: School of Mathematics and Statistics, University of Glasgow)

The speaker mentioned using ColorBrewer when selecting a colour palette for visualisations – particularly for colourblind viewers.  He also talked about how colours can be used constructively to show changes (like in a Heatmap). However palettes can naturally be associated to different themes and charts should not make inferential connections inappropriately eg if graphing data about cancers, do not use colours where shading gets darker and darker if no one of the cancers is worse than the others.  Clegg’s slides provide a number of tips outlining this further: Skills Matter presentation / very similar blog post from Clegg while at Pearson.

When responding to questions, Andrew was asked about preferred visualisation tools. He mentioned plot.ly which I’m a big fan of and also D3 which also came up in Robert Witoff’s NASA talk as a good, open source method to visualise data.

Non-linear solver for LibreOffice?

You've tried to install the "Solver for Nonlinear Programming" for LibreOffice on Ubuntu but still can't use the new Solver when you open the dialog from

Tools > Solver...

Check you have the error I've encountered by looking up the nonlinear solver under

Tools > Extension Manager...

"Error: The status of this extension is unknown".

The plugin was written for OpenOffice, not LibreOffice. You have to fix this by uninstalling LibreOffice and then installing OpenOffice.