Python performance I - numba

2019-05-21

Python Performance

This is the first post about python performance. This post will be of how to make python code go faster.

1. Exploring python compilers

The first way to improve the python performance is by using different compilers. The most famous ones are:

Numba and cython are similar in terms of speed and pypy is a little bit slower. You can read more at this quora question.

1.1. Cython overview

Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language (based on Pyrex).

When using cython you will need to specify variables classes so the code will look slightly different. For example:

cdef int a = 0
for i in range(10):
    a += i
print(a)

You can also cythonize python code. So for example you can create a file to compute Fibonacci series:

fib.pyx

def fib(n):
    """Print the Fibonacci series up to n."""
    a, b = 0, 1
    while b < n:
        print(b, end=' ')
        a, b = b, a + b

    print()

And then transform it to cython:

from distutils.core import setup
from Cython.Build import cythonize

setup(
    ext_modules=cythonize("fib.pyx"),
)

Even though cython is fast I don't like having to change the code to adapt it to cython.

1.2. Pypy overview

PyPy is a fast, compliant alternative implementation of the Python language (2.7.13 and 3.5.3, 3.6). It has several advantages and distinct features:

Speed: thanks to its Just-in-Time compiler, Python programs often run faster on PyPy. (What is a JIT compiler?)
Memory usage: memory-hungry Python programs (several hundreds of MBs or more) might end up taking less space than they do in CPython.
Compatibility: PyPy is highly compatible with existing python code. It supports cffi and can run popular python libraries like twisted and django.
Stackless: PyPy comes by default with support for stackless mode, providing micro-threads for massive concurrency.

When using pypy you can write regular python code. The main disadvantage of pypy is that you can't use other libraries out of the box. So if for example you wan to use pandas you will need a pypy implementation of it. This makes using pypy along common python packages unconvinient.

1.3. Numba overview

Numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN.

You don't need to replace the Python interpreter, run a separate compilation step, or even have a C/C++ compiler installed. Just apply one of the Numba decorators to your Python function, and Numba does the rest.

So you can use python code without modifications and you won't have compatibility problems with other packages when using numba. This is the reason I preffer numba over cython and pypy and it is also one of the faster of the three.

2. Using numba

Numba provides some decorators that will transform functions to C. This way the execution times will be faster.

By adding the jit decorator to a function numba will transform it to C. For example:

from numba import jit, njit

@jit
def m_sum(data):
    out = 0
    for x in data:
        out += x

    return out

m_sum([1, 2, 3])

Out: 6

2.1. `jit` vs `njit` decorators

The default numba decorator is jit. It is possible to pass use nopython mode with @jit(nopython=True) or with the decorator njit which is equivalent. njit decorator is faster than jit but supports less features than jit.

For example you can't cast to string (with for example str(10)) inside a njit decorated function. So you should try to use njit decorator if possible specialy for numerical calculations.

2.2. `vectorize` decorator

Using the vectorize decorator, you write your function as operating over input scalars, rather than arrays. Numba will generate the surrounding loop (or kernel) allowing efficient iteration over the actual inputs.

So for example given a np.array we can calculat the square with the follwing function.

from numba import vectorize

mlist = np.array(range(10))

@vectorize
def square(x):
    return x**2

square(mlist)

Out: array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81], dtype=int64)

To increase the speed you can specify the inputs and output dtypes using int8 since in this example is enough for storing 81 which is the max value. You can check the numpy documentation to see the different dtypes.

from numba import vectorize, int8

mlist = np.array(range(10))

@vectorize([int8(int8)])
def square(x):
    return x**2

square(mlist)

Out: array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81], dtype=int8)

The format for declaring the dtypes is [output(input1, input2, ...)].

3. Test numba performance

3.1. Test `sum` function

First we will time the execution of different sum functions. The first option will be a for loop in python:

iter_and_sum

def iter_and_sum(data):
    """ Sums each element in an iterable """
    out = 0
    for x in data:
        out += x

    return out

And we will test both jit and njit numba decorators. As an example this is how the jit version would be defined:

jit

@jit
def jit_iter_and_sum(data):
    """ Sums each element in an iterable """
    out = 0
    for x in data:
        out += x

    return out

# Or equivalent:
jit(iter_and_sum)

The other options is to use the sum() python function and the numpy.sum().

For the tests we will try different array lenghts.

numpy size	iter_and_sum	jit	njit	np.sum	sum
10^4	0.000926	0.000001	0.000001	0.000008	0.000750
10^5	0.009162	0.000009	0.000010	0.000042	0.007408
10^6	0.092395	0.000106	0.000120	0.000408	0.075894
10^7	0.936200	0.002313	0.002345	0.004393	0.766890
10^8	9.335412	0.021232	0.020506	0.042701	7.549756
10^9	96.134125	0.238718	0.225308	0.433185	76.537118

First time result has been excluded to avoid computing the time of compilation before calculating the mean.

It is important to mention that the results would be different using python lists instead of numpy arrays.

For small arrays numpy.sum is the best option but for large arrays both jit and njit perform really well.

In general is a really good idea to use numpy functions since they use cython under the hood. In all cases using the raw loop iter_and_sum or the default python sum gives poor results.

numba can be 400x better than a python loop

3.2. Test fix time

At one of the projects I was working at my company I found an intersting problem. I had a huge dataframe (650 milion rows) and among other columns there was one for the date and one for time.

The date column followed the format YYYY-MM-DD like 2019-05-15. The problem was with the time column. The original format was hhmmsscc (c for centiseconds) and instead of being a string it was presented as an integer. So for example you could find the following values:

value	represents
23411210	23:41:12.10
9051137	09:05:11.37
712	00:00:07.12

3.2.1. The functions

So to fix that I transformed the values to string, then added 0 until I had 8 chars and finally split and add : and . so that then it could be transformed.

Let's use this problem for testing numba. First let's see all different options I came up.

3.2.1.1. zfill

This was my first approach which was good enough. The idea is to transform the time column to string and then apply zfill(8). After that I created a string with the appropiate format and transform the whole datetime with pd.to_datetime.

def zfill(df):
    """
        1. Transform time to str
        2. zfill
        3. split time as string lists
        4. pd.to_datetime
    """

    aux = df["time"].apply(str).apply(lambda x: x.zfill(8)).str
    return pd.to_datetime(df["date"] + " " + aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:])

3.2.1.2. fix_time_individual

The idea in this option is to create a function decorated with jit that transform one element of the time column and apply the function to the column.

Insted of using pd.to_datetime I use .astype(np.datetime64) since is faster.

def fix_time_individual(df):
    """
        1. pandas.apply a jit function to add 0 to time
        2. concat date + time
        3. change to np.datetime64
    """

    @jit
    def _fix_time(x):
        aux = "0"*(8 - len(str(x))) + str(x)
        return aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:]

    return (df["date"] + " " + df["time"].apply(_fix_time)).astype(np.datetime64)

3.2.1.3. fix_time_np_string

With this solution I created and empty numpy array and filled with a loop inside the jit decorated function.

def fix_time_np_string(df):
    """
        1. Use a jit function to add 0 to each time
        2. concat date + time
        3. change to np.datetime64
    """

    @jit
    def _fix_time(mlist):

        out = np.empty(mlist.shape, dtype=np.object)

        for i in range(len(mlist)):

            elem = str(mlist[i])
            aux = "0"*(8 - len(elem)) + elem

            out[i] = aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:]

        return out

    return (df["date"].values + " " + _fix_time(df["time"].values)).astype(np.datetime64)

3.2.1.4. fix_time_np_datetime

In this case the I also create an empty numpy array but with datetime64[s] dtype. This way I can iterate over both time and date at the same time.

def fix_time_np_datetime(df):
    """
        1. Iterate time and date with jit function
        2. Transform each element to string and add 0s
        3. Split the string
        4. Cast each element to np.datetime64
    """

    @jit
    def _fix_date(mdate, mtime):

        out = np.empty(mtime.shape, dtype="datetime64[s]")

        for i in range(len(mtime)):

            elem = str(mtime[i])
            aux = "0"*(8 - len(elem)) + elem

            aux = mdate[i] + " " + aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:]

            out[i] = np.datetime64(aux)

        return out

    return _fix_date(df["date"].values, df["time"].values)

3.2.1.5. np_divmod_jit

In this solution I process the time column as number and with the np.divmod function I create a value that represents a timedelta. After transforming the time column I change the dtype to timedelta64[ms] and sum it to the date column as a datetime64.

def np_divmod_jit(df):
    """
        1. Iterate time and date with jit function
        2. Use np.divmod to transfom HHMMSSCC to miliseconds integer
        3. Cast date as np.datetime and time to timedelta
        4. Sum date and time
    """

    @jit
    def _fix_date(mdate, mtime):

        time_out = np.empty(mtime.shape[0], dtype=np.int32)

        for i in range(mtime.shape[0]):
            aux, cent = np.divmod(mtime[i], 100)
            aux, seconds = np.divmod(aux, 100)
            hours, minutes = np.divmod(aux, 100)

            time_out[i] = 10*(cent + 100*(seconds + 60*(minutes + 60*hours)))

        return mdate.astype(np.datetime64) + time_out.astype("timedelta64[ms]")

    return _fix_date(df["date"].values, df["time"].values)

3.2.1.6. divmod_njit

It is the same as the previous example but after changing np.divmod to the python divmod function I can use the njit decorator for the first time.

def divmod_njit(df):
    """
        1. Iterate time with njit function
        2. Use divmod to transfom HHMMSSCC to miliseconds integer
        3. Outside the njit function cast date as np.datetime and time to timedelta
        4. Sum date and time
    """

    @njit
    def _fix_time(mtime):

        time_out = np.empty(mtime.shape)

        for i in range(mtime.shape[0]):
            aux, cent = divmod(mtime[i], 100)
            aux, seconds = divmod(aux, 100)
            hours, minutes = divmod(aux, 100)

            time_out[i] = 10 * (cent + 100 * (seconds + 60 * (minutes + 60 * hours)))

        return time_out

    return df["date"].values.astype(np.datetime64) + _fix_time(
        df["time"].values.astype(np.int32)
    ).astype("timedelta64[ms]")

3.2.1.7. divmod_vectorize

This case is exactly the same as the previous one but instead of doing the for loop I use the numba vectorize decorator.

def divmod_vectorize(df):
    """
        1. Use divmod to transfom HHMMSSCC to miliseconds integer with vectorize
        2. Outside the njit function cast date as np.datetime and time to timedelta
        3. Sum date and time
    """

    @vectorize([int32(int32)])
    def _fix_time(mtime):

        aux, cent = divmod(mtime, 100)
        aux, seconds = divmod(aux, 100)
        hours, minutes = divmod(aux, 100)

        return 10 * (cent + 100 * (seconds + 60 * (minutes + 60 * hours)))

    return df["date"].values.astype(np.datetime64) + _fix_time(
        df["time"].values.astype(np.int32)
    ).astype("timedelta64[ms]")

3.2.2. The results

size	zfill	fix_time_individual	fix_time_np_string	fix_time_np_datetime	np_divmod_jit	divmod_njit	divmod_vectorize
10^2	0.004388	0.213377	0.434647	0.535352	0.428438	0.132344	0.088566
10^3	0.007160	0.223842	0.443458	0.518424	0.437135	0.131134	0.090028
10^4	0.030412	0.265528	0.483543	0.558784	0.536995	0.142893	0.101280
10^5	0.295480	0.711711	0.735900	0.838042	1.435995	0.168305	0.127379
10^6	3.424938	5.112963	3.425526	3.527455	10.157285	0.419848	0.384422
10^7	37.434941	50.304211	30.280986	32.378085	97.415906	2.916234	2.804018
10^8	721.881598	508.880023	310.148904	326.442289	973.317809	31.491899	31.602522

Both divmod_njit and divmod_vectorize performs really well compared to the other options. Is intersting that my first approach (zfill) is the best for small dataframes but it starts to underperform at 10^5.

Working with numbers in numba is really fast.

3.2.3. Speed with different machines

I did this test with up to 10^7 elements with my computer. I was not able to increase the number of elements due to an out of memory error (not enough RAM).

Then I repeated everything with an M64 azure machine.

The specs of each machine are:

feature	my computer	azure M64
processor	Intel Core i5-6500 3.2Ghz	Intel Xeon E7-8890 v3 2.5GHz (Haswell)
cores	4	64
RAM	16 GB	1 TB

Let's compare the results of two functions.

size	zfill		divmod_vectorize
size	i5	azure	i5	azure
10^2	0.004002	0.004388	0.090219	0.088566
10^3	0.005410	0.007160	0.091660	0.090028
10^4	0.031415	0.030412	0.100109	0.101280
10^5	0.291654	0.295480	0.191021	0.127379
10^6	3.143935	3.424938	1.051343	0.384422
10^7	32.780381	37.434941	9.711320	2.804018

If we take a look at the results we can see that they both perform similar.

It is important to remember that pandas only work with one core so I am not using the full potential of the machines. Having more RAM allows to work with more data but it does not increase the speed. With the numba vectorize the azure machine performs better as the size increases.

4. More info

You can read the numba documentation. Some examples of numba form their web page at numba examples.

I also suggest you read this post from Jake VanderPlas about code optimization with numba.

Subscribe to the mailing list