I've recently been trying to build some machine learning skills by working on prediction challenges from Kaggle. Getting some R skills was also a goal. I was trying to work on the San Francisco Crime project and planned to start by growing a Random Forest for the full training set dropping some of the predictors (n=878050, p=5).

After debugging the code with a small sample of the data, I ran with the full dataset and my laptop ran out of memory. I switched to the Google Cloud Compute instance I'm trialling: 2 CPUs, 13GB memory + 7GB of swap. Still not enough memory to build a single-tree Random Forest in R. Next I took the training data and tried to build Random Forests for one of the 10 Districts. Again I couldn't build the forest due to insufficient memory on the big Google instance I had.

Thinking this was getting ridiculous, I ported the code to Python to use Scikit-Learn. I ran the same single-tree random forest; one forest for each district in memory at the same time and the model built fine on my low-spec laptop sharing memory with Chromium, OpenOffice, etc. I intentionally avoided optimising the Python code so it had no head-start on the R code. These are the two versions of the random forest:

analyse.R
analysis.py