CallMiner Product Innovation Series: Q4 2024
CallMiner's, Bruce McMahon, shares key product updates from Q4 2024. New AI capabilities, including CallMiner AI Assist, add to the long list of AI ad...
Python is great. It shortens the distance between “programmer thought” and “programmer results”. And with libraries like numpy, Python’s performance can be good enough. But if you want to easily optimize your code performance, you should consider Cython.
Python already has the ability to call external C/C++ code from Python. Cython greatly simplifies that effort and gives your code a performance boost.
We use software that uses Cython already - SciPy, pandas, scikit-learn, and spaCy have large chunks that are already written in Cython.
Since Cython is a superset of Python, if you don’t want to use the compiled features or extensions, you don’t have to use the compiler features or extensions. You can simply use the Python subset. It only adds, never subtracts.
It may actually make our Python code safer, because our code is now obfuscated by compilation.
You may have heard that Python is an interpreted language. Really though, the line between compilers and interpreters is blurry. The Python interpreter actually performs two actions when reading your source code:
So running Python involves both compiling (into bytecode) and interpretation (just reading bytecode line by line). Bytecode is an intermediate version of Python.
The most common interpreter for Python is written in C, and is called CPython. CPython (the standard C implementation of Python) is different from Cython. However, Cython does depend on CPython.
How does the CPython interpreter work? It’s a virtual machine (VM) - which is software that emulates a real machine or computer. Python’s VM is actually a stack machine. Stacks are FIFO data structures (first-in, first-out).
It’s probably easier to just show you how it works.
LOAD_GLOBAL takes the value at index 0 and pushes it onto the evaluation stack. In this case, the name of the built-in global max function.
LOAD_FAST takes the value at 0 and pushes it onto the eval stack as a local variable.
The eval stack now looks like [“max”, “x”] # ← top of stack
CALL_FUNCTION 1 pops the top item (x) off the stack and then pops the next item, the name of the function (max) off the stack, and then calls the function with the argument. It pushes the result back onto the stack.
The whole thing happens again with y. So now both x and y are on the stack. Then BINARY_SUBTRACT is called, which implicitly assumes the top 2 elements on the stack are the values to be subtracted. This value is put onto the stack, and then it’s returned immediately with the RETURN VALUE instruction.
I didn’t even go into the fact that there’s a function stack which records the current function and to where the function should return when it completes
What is Cython? Two things:
It’s right between high-level Python and low-level C. (I’m old. People used to call C “high level”.)
Cython is a superset of Python--it does everything Python already does, plus extension support.
What’s so great about C? About 50 years of expertise in optimizing its speed! So Cython gives you the extreme flexibility and ease of Python with the performance of C. Best of both worlds.
SpaCy uses Cython! (In fact, it’s the “Cy” in “SpaCy”.) Other projects you may have heard about use it too:
CPython (the C implementation of Python) has a C API, which lets you interface C into Python. Cython is like a very polished wrapper around this.
You can take a fast C or C++ library and use it from Python, or you can take Python that needs to perform better and get a speed boost.
Let’s pretend parsing a CSV is easy and that we just want a space-separated concatenation of all the words in the 4th column. Here’s the first few lines of such a CSV:
A very simple Python version may look like: Let’s give this a 100,000-word transcript to reconstruct. How does it perform?
9 ms, most of the time spent splitting lines.
Let’s try Cythonizing this, not changing a single thing about the Python. Cython will create C/C++ code that gets compiled to assembly code. To do this, after installing Cython:
First, copy your file but now give it the extension .pyx (meaning “Python extension”). Then, create a setup file setup.py to build the module:
Now build your module:
Now run it by importing the module like any other:
When I profile this, here’s what I get:
50 ms. For doing nothing extra besides a compilation.
Why is Cython so much faster for the same file? In general, Python is slower than C / C++ / C# because:
Here’s what is generated by Cython:
The right column is assembly code which gets translated into the raw machine language of the computer it’s on (the hex digits on the left). Assembly code describes what a CPU does (MOV, JMP, etc). An assembler takes this code and emits machine language.
There’s a strong analogy between C generating assembly code and Python interpreters generating bytecode.
Let’s try one more optimization: turns out someone built a really fast C++ CSV parser, and I want to call it from Python. No problem!
First, I created a C++ wrapper class to isolate the functionality I needed from the fast CSV parser. This is stored in a header file, concat.h, which contains all the function declarations your code needs (a declaration tells C++ about data types of variables):
My extension code this time looks more complicated:
These functions are defined in the source code file for my wrapper class:
Once you get it all working, it outperforms both tests so far:
These differences may not seem like much, but let’s scale up the transcript to contain 9M words. How do they perform?
The savings stack up!
Cython in its own words:
Cython is a compiler which compiles Python-like code files to C code. Still, "Cython is not a Python to C translator". That is, it doesn’t take your full program and “turns it into C” – rather, the result makes full use of the Python runtime environment. A way of looking at it may be that your code is still Python in that it runs within the Python runtime environment, but rather than compiling to interpreted Python bytecode one compiles to native machine code (but with the addition of extra syntax for easy embedding of faster C-like code).
This has two important consequences: