This is the first post about python performance. This post will be of how to make python code go faster.
The first way to improve the python performance is by using different compilers. The most famous ones are:
Numba and cython are similar in terms of speed and pypy is a little bit slower. You can read more at this quora question.
Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language (based on Pyrex).
When using cython you will need to specify variables classes so the code will look slightly different. For example:
cdef int a = 0 for i in range(10): a += i print(a)
You can also cythonize python code. So for example you can create a file to compute Fibonacci series:
def fib(n): """Print the Fibonacci series up to n.""" a, b = 0, 1 while b < n: print(b, end=' ') a, b = b, a + b print()
And then transform it to cython:
from distutils.core import setup from Cython.Build import cythonize setup( ext_modules=cythonize("fib.pyx"), )
Even though cython is fast I don't like having to change the code to adapt it to cython.
PyPy is a fast, compliant alternative implementation of the Python language (2.7.13 and 3.5.3, 3.6). It has several advantages and distinct features:
When using pypy you can write regular python code. The main disadvantage of pypy is that you can't use other libraries out of the box. So if for example you wan to use pandas you will need a pypy implementation of it. This makes using pypy along common python packages unconvinient.
Numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN.
You don't need to replace the Python interpreter, run a separate compilation step, or even have a C/C++ compiler installed. Just apply one of the Numba decorators to your Python function, and Numba does the rest.
So you can use python code without modifications and you won't have compatibility problems with other packages when using numba. This is the reason I preffer numba over cython and pypy and it is also one of the faster of the three.
Numba provides some decorators that will transform functions to C
. This way the execution times will be faster.
By adding the jit
decorator to a function numba will transform it to C
. For example:
from numba import jit, njit @jit def m_sum(data): out = 0 for x in data: out += x return out m_sum([1, 2, 3])
jit
vs njit
decoratorsThe default numba decorator is jit
. It is possible to pass use nopython mode with @jit(nopython=True)
or with the decorator njit
which is equivalent.
njit
decorator is faster than jit
but supports less features than jit.
For example you can't cast to string (with for example str(10)
) inside a njit
decorated function.
So you should try to use njit
decorator if possible specialy for numerical calculations.
vectorize
decoratorUsing the vectorize
decorator, you write your function as operating over input scalars, rather than arrays. Numba will generate the surrounding loop (or kernel) allowing efficient iteration over the actual inputs.
So for example given a np.array
we can calculat the square with the follwing function.
from numba import vectorize mlist = np.array(range(10)) @vectorize def square(x): return x**2 square(mlist)
To increase the speed you can specify the inputs and output dtypes using int8
since in this example is enough for storing 81
which is the max value.
You can check the numpy documentation to see the different dtypes
.
from numba import vectorize, int8 mlist = np.array(range(10)) @vectorize([int8(int8)]) def square(x): return x**2 square(mlist)
The format for declaring the dtypes
is [output(input1, input2, ...)]
.
sum
functionFirst we will time the execution of different sum
functions. The first option will be a for loop in python:
def iter_and_sum(data): """ Sums each element in an iterable """ out = 0 for x in data: out += x return out
And we will test both jit
and njit
numba decorators. As an example this is how the jit
version would be defined:
@jit def jit_iter_and_sum(data): """ Sums each element in an iterable """ out = 0 for x in data: out += x return out # Or equivalent: jit(iter_and_sum)
The other options is to use the sum()
python function and the numpy.sum()
.
For the tests we will try different array lenghts.
numpy size | iter_and_sum | jit | njit | np.sum | sum |
---|---|---|---|---|---|
10^4 | 0.000926 | 0.000001 | 0.000001 | 0.000008 | 0.000750 |
10^5 | 0.009162 | 0.000009 | 0.000010 | 0.000042 | 0.007408 |
10^6 | 0.092395 | 0.000106 | 0.000120 | 0.000408 | 0.075894 |
10^7 | 0.936200 | 0.002313 | 0.002345 | 0.004393 | 0.766890 |
10^8 | 9.335412 | 0.021232 | 0.020506 | 0.042701 | 7.549756 |
10^9 | 96.134125 | 0.238718 | 0.225308 | 0.433185 | 76.537118 |
First time result has been excluded to avoid computing the time of compilation before calculating the mean.
It is important to mention that the results would be different using python lists instead of numpy
arrays.
For small arrays numpy.sum
is the best option but for large arrays both jit
and njit
perform really well.
In general is a really good idea to use numpy functions since they use cython
under the hood.
In all cases using the raw loop iter_and_sum
or the default python sum
gives poor results.
numba can be 400x better than a python loop
At one of the projects I was working at my company I found an intersting problem.
I had a huge dataframe (650 milion rows) and among other columns there was one for the date
and one for time
.
The date
column followed the format YYYY-MM-DD
like 2019-05-15. The problem was with the time
column.
The original format was hhmmsscc
(c
for centiseconds) and instead of being a string it was presented as an integer. So for example you could find the following values:
value | represents |
---|---|
23411210 | 23:41:12.10 |
9051137 | 09:05:11.37 |
712 | 00:00:07.12 |
So to fix that I transformed the values to string, then added 0 until I had 8 chars and finally split and add :
and .
so that then it could be transformed.
Let's use this problem for testing numba. First let's see all different options I came up.
This was my first approach which was good enough. The idea is to transform the time
column to string and then apply zfill(8)
. After that I created a string with the appropiate format and transform the whole datetime
with pd.to_datetime
.
def zfill(df): """ 1. Transform time to str 2. zfill 3. split time as string lists 4. pd.to_datetime """ aux = df["time"].apply(str).apply(lambda x: x.zfill(8)).str return pd.to_datetime(df["date"] + " " + aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:])
The idea in this option is to create a function decorated with jit
that transform one element of the time
column and apply the function to the column.
Insted of using pd.to_datetime
I use .astype(np.datetime64)
since is faster.
def fix_time_individual(df): """ 1. pandas.apply a jit function to add 0 to time 2. concat date + time 3. change to np.datetime64 """ @jit def _fix_time(x): aux = "0"*(8 - len(str(x))) + str(x) return aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:] return (df["date"] + " " + df["time"].apply(_fix_time)).astype(np.datetime64)
With this solution I created and empty numpy
array and filled with a loop inside the jit
decorated function.
def fix_time_np_string(df): """ 1. Use a jit function to add 0 to each time 2. concat date + time 3. change to np.datetime64 """ @jit def _fix_time(mlist): out = np.empty(mlist.shape, dtype=np.object) for i in range(len(mlist)): elem = str(mlist[i]) aux = "0"*(8 - len(elem)) + elem out[i] = aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:] return out return (df["date"].values + " " + _fix_time(df["time"].values)).astype(np.datetime64)
In this case the I also create an empty numpy
array but with datetime64[s]
dtype. This way I can iterate over both time
and date
at the same time.
def fix_time_np_datetime(df): """ 1. Iterate time and date with jit function 2. Transform each element to string and add 0s 3. Split the string 4. Cast each element to np.datetime64 """ @jit def _fix_date(mdate, mtime): out = np.empty(mtime.shape, dtype="datetime64[s]") for i in range(len(mtime)): elem = str(mtime[i]) aux = "0"*(8 - len(elem)) + elem aux = mdate[i] + " " + aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:] out[i] = np.datetime64(aux) return out return _fix_date(df["date"].values, df["time"].values)
In this solution I process the time
column as number and with the np.divmod
function I create a value that represents a timedelta. After transforming the time
column I change the dtype to timedelta64[ms]
and sum it to the date
column as a datetime64
.
def np_divmod_jit(df): """ 1. Iterate time and date with jit function 2. Use np.divmod to transfom HHMMSSCC to miliseconds integer 3. Cast date as np.datetime and time to timedelta 4. Sum date and time """ @jit def _fix_date(mdate, mtime): time_out = np.empty(mtime.shape[0], dtype=np.int32) for i in range(mtime.shape[0]): aux, cent = np.divmod(mtime[i], 100) aux, seconds = np.divmod(aux, 100) hours, minutes = np.divmod(aux, 100) time_out[i] = 10*(cent + 100*(seconds + 60*(minutes + 60*hours))) return mdate.astype(np.datetime64) + time_out.astype("timedelta64[ms]") return _fix_date(df["date"].values, df["time"].values)
It is the same as the previous example but after changing np.divmod
to the python divmod
function I can use the njit
decorator for the first time.
def divmod_njit(df): """ 1. Iterate time with njit function 2. Use divmod to transfom HHMMSSCC to miliseconds integer 3. Outside the njit function cast date as np.datetime and time to timedelta 4. Sum date and time """ @njit def _fix_time(mtime): time_out = np.empty(mtime.shape) for i in range(mtime.shape[0]): aux, cent = divmod(mtime[i], 100) aux, seconds = divmod(aux, 100) hours, minutes = divmod(aux, 100) time_out[i] = 10 * (cent + 100 * (seconds + 60 * (minutes + 60 * hours))) return time_out return df["date"].values.astype(np.datetime64) + _fix_time( df["time"].values.astype(np.int32) ).astype("timedelta64[ms]")
This case is exactly the same as the previous one but instead of doing the for
loop I use the numba vectorize
decorator.
def divmod_vectorize(df): """ 1. Use divmod to transfom HHMMSSCC to miliseconds integer with vectorize 2. Outside the njit function cast date as np.datetime and time to timedelta 3. Sum date and time """ @vectorize([int32(int32)]) def _fix_time(mtime): aux, cent = divmod(mtime, 100) aux, seconds = divmod(aux, 100) hours, minutes = divmod(aux, 100) return 10 * (cent + 100 * (seconds + 60 * (minutes + 60 * hours))) return df["date"].values.astype(np.datetime64) + _fix_time( df["time"].values.astype(np.int32) ).astype("timedelta64[ms]")
size | zfill | fix_time_individual | fix_time_np_string | fix_time_np_datetime | np_divmod_jit | divmod_njit | divmod_vectorize |
---|---|---|---|---|---|---|---|
10^2 | 0.004388 | 0.213377 | 0.434647 | 0.535352 | 0.428438 | 0.132344 | 0.088566 |
10^3 | 0.007160 | 0.223842 | 0.443458 | 0.518424 | 0.437135 | 0.131134 | 0.090028 |
10^4 | 0.030412 | 0.265528 | 0.483543 | 0.558784 | 0.536995 | 0.142893 | 0.101280 |
10^5 | 0.295480 | 0.711711 | 0.735900 | 0.838042 | 1.435995 | 0.168305 | 0.127379 |
10^6 | 3.424938 | 5.112963 | 3.425526 | 3.527455 | 10.157285 | 0.419848 | 0.384422 |
10^7 | 37.434941 | 50.304211 | 30.280986 | 32.378085 | 97.415906 | 2.916234 | 2.804018 |
10^8 | 721.881598 | 508.880023 | 310.148904 | 326.442289 | 973.317809 | 31.491899 | 31.602522 |
Both divmod_njit
and divmod_vectorize
performs really well compared to the other options. Is intersting that my first approach (zfill
) is the best for small dataframes but it starts to underperform at 10^5.
Working with numbers in numba is really fast.
I did this test with up to 10^7
elements with my computer. I was not able to increase the number of elements due to an out of memory error (not enough RAM).
Then I repeated everything with an M64
azure machine.
The specs of each machine are:
feature | my computer | azure M64 |
---|---|---|
processor | Intel Core i5-6500 3.2Ghz | Intel Xeon E7-8890 v3 2.5GHz (Haswell) |
cores | 4 | 64 |
RAM | 16 GB | 1 TB |
Let's compare the results of two functions.
size | zfill | divmod_vectorize | ||
---|---|---|---|---|
i5 | azure | i5 | azure | |
10^2 | 0.004002 | 0.004388 | 0.090219 | 0.088566 |
10^3 | 0.005410 | 0.007160 | 0.091660 | 0.090028 |
10^4 | 0.031415 | 0.030412 | 0.100109 | 0.101280 |
10^5 | 0.291654 | 0.295480 | 0.191021 | 0.127379 |
10^6 | 3.143935 | 3.424938 | 1.051343 | 0.384422 |
10^7 | 32.780381 | 37.434941 | 9.711320 | 2.804018 |
If we take a look at the results we can see that they both perform similar.
It is important to remember that pandas only work with one core so I am not using the full potential of the machines. Having more RAM allows to work with more data but it does not increase the speed. With the numba vectorize the azure machine performs better as the size increases.
You can read the numba documentation. Some examples of numba form their web page at numba examples.
I also suggest you read this post from Jake VanderPlas about code optimization with numba.