R vs Python: memory use

I've recently been trying to build some machine learning skills by working on prediction challenges from Kaggle. Getting some R skills was also a goal. I was trying to work on the San Francisco Crime project and planned to start by growing a Random Forest for the full training set dropping some of the predictors (n=878050, p=5).

After debugging the code with a small sample of the data, I ran with the full dataset and my laptop ran out of memory. I switched to the Google Cloud Compute instance I'm trialling: 2 CPUs, 13GB memory + 7GB of swap. Still not enough memory to build a single-tree Random Forest in R. Next I took the training data and tried to build Random Forests for one of the 10 Districts. Again I couldn't build the forest due to insufficient memory on the big Google instance I had.

Thinking this was getting ridiculous, I ported the code to Python to use Scikit-Learn. I ran the same single-tree random forest; one forest for each district in memory at the same time and the model built fine on my low-spec laptop sharing memory with Chromium, OpenOffice, etc. I intentionally avoided optimising the Python code so it had no head-start on the R code. These are the two versions of the random forest:

analyse.R
analysis.py

Seek a buffer to the start

I've been caught out a few times by forgetting this: when you create a StringIO buffer, writing some lines to it, you need to reset the pointer back to the beginning of the buffer. The all important line is in bold. Without this the file will be uploaded with no contents.

Call to super

Another I'm continually forgetting the syntax to:

(where of course ChildClass inherits from the super class)

Operators or lambdas

While more flexible, using lambdas in map/filter calls are slower than using the operator library of pre-built functions. Some useful examples:

Mock objects... for testing

Mock attributes and functions for objects you need quickly in a test

List slicing

For shorter lists:

For longer sequences, use iterators:

Where 1, None, 2 correspond to the start, end and steps to take.

Slicing 2d lists with numpy