Memoization for Machine Learning

Memoization for Machine Learning models

During experimentation and parameter scanning with machine learning models, we often run the same expensive calculation many times.

We can thus gain considerable amounts of time using memoization: saving the results of the function based on the arguments such that they can be looked up a next time. Instead of embedding each time the same read/write functionality in our methods, it is much nicer to use a decorator for this.

Existing standard decorators were a bit too simplistic (not saving to disk, no hashing of arrays), or had problems with Tensorflow objects (joblib.Memory crashes when trying to hash them, which is a pity).

So I decide to write a simple decorator myself, that has the following properties:

  • it pickles the results to disk
  • the pickled files are located in CACHE_PATH (see function). If it doesn't exist, it is created
  • the pickled file paths consist of the name of the inner (decorated) function, with the arguments appended
  • the encoding of the arguments is simply: argument name and value, or -in case a pandas DataFrame or numpy Array is passed - the first 10 characters of the SHA-1 hex-hash of the data

The latter ensures that results with shuffled datasets but otherwise identical arguments are cached separately. Thus, we should thus only shuffle with a random seed to benefit from this function (which is anyway good practice).

In [1]:
import os
import numpy as np
import pandas as pd
from functools import wraps
import time
import pickle
import hashlib
In [2]:
def memotodisk(fun):
    """ Memoization decorator, caching the results for function f to disk.
    The cache filename consists of the function name + (hex-hashed) arguments of the function f
    np.DataFrames and np.arrays are hex-hashed
    """
    CACHE_PATH = r'.cache'
    if not os.path.exists(CACHE_PATH):
        os.makedirs(CACHE_PATH)

    def code_argument(arg):
        if hasattr(arg, '__name__'):
            return arg.__name__
        if isinstance(arg, pd.DataFrame):
            # NB: Python's hash function is randomized for security reasons
            # Therefore, use hashlib.
            return hashlib.sha1(arg.values).hexdigest()[:10] # Cut-off to reduce filename length
        if isinstance(arg, np.ndarray):
            try:
                return hashlib.sha1(arg).hexdigest()[:10]
            except ValueError:
                # In case the numpy array is not C-ordered, fix this
                return hashlib.sha1(arg.copy(order='C')).hexdigest()[:10]
        return str(arg)[:10]

    @wraps(fun)
    def new_fun(*args, **kwargs):
        string_args = ''
        if len(args) > 0:
            string_args += '_' + '_'.join([code_argument(arg) for arg in args])
        if len(kwargs) > 0:
            string_args += '_' + '_'.join([(str(k)[:10] + code_argument(v)) for k,v in kwargs.items()])

        filename = os.path.join(CACHE_PATH, '.cache_{}{}.pickle'.format(fun.__name__, string_args))

        if os.path.exists(filename):
            with open(filename, 'rb') as file:
                result = pickle.load(file)
        else:
            result = fun(*args, **kwargs)
            with open(filename, 'wb') as file:
                pickle.dump(result, file)
        return result
    return new_fun

Let's test this with a simple function

In [3]:
# To test the memoization decorator
@memotodisk
def some_expensive_function(t, X):
    time.sleep(t)
    return(t, len(X))

We give the function some random data, and a waiting time of 2 seconds.

The results will get cached to disk after running the inner, "expensive_function"

In [4]:
np.random.seed(1)
X = np.random.rand(100, 1)
t0 = time.time()
print(some_expensive_function(t=2, X=X))
print('Function took {:.3f} seconds'.format(time.time() - t0))
(2, 100)
Function took 2.005 seconds
In [6]:
[f for f in os.listdir('.cache') if 'expensive_function' in f]
Out[6]:
['.cache_some_expensive_function_t2_X5aa8b09013.pickle']

So if we now run it again, it will be very fast (only taking the disk IO time):

In [7]:
t0 = time.time()
print(some_expensive_function(t=2, X=X))
print('Function took {:.3f} seconds'.format(time.time() - t0))
(2, 100)
Function took 0.001 seconds

We will shuffle the data, and run the function again. Because the numpy array contains different data, it gets hashed differently, and a new result will be calculated and cached.

Note that this is essential: when we test algorithms on different datasets or on shuffled data (for instance, to measure the variance of a regressor), we want those results cached too.

In [8]:
np.random.seed(1)
np.random.shuffle(X)
t0 = time.time()
print(some_expensive_function(t=2, X=X))
print('Function took {:.3f} seconds'.format(time.time() - t0))
(2, 100)
Function took 2.004 seconds

This new result was also cached:

In [9]:
[f for f in os.listdir('.cache') if 'expensive_function' in f]
Out[9]:
['.cache_some_expensive_function_t2_X5aa8b09013.pickle',
 '.cache_some_expensive_function_t2_X8ec6555a9f.pickle']
In [10]:
t0 = time.time()
print(some_expensive_function(t=2, X=X))
print('Function took {:.3f} seconds'.format(time.time() - t0))
(2, 100)
Function took 0.001 seconds

Advice on usage: this function is best used for methods that encapsulate the machine learning model (and handle the passing of data and setting of parameters). This way, higher level functions (that do for instance parameter scanning) can be modified, without having to run the expensive algorithm again.

Comments