Blog Home

Technical basics series: A breakdown of Cython basics

Company

Chris Jones

April 22, 2021

Cython Basics image
Cython Basics image

Python is great. It shortens the distance between “programmer thought” and “programmer results”. And with libraries like numpy, Python’s performance can be good enough. But if you want to easily optimize your code performance, you should consider Cython. 

TL;DR 

Python already has the ability to call external C/C++ code from Python. Cython greatly simplifies that effort and gives your code a performance boost. 

We use software that uses Cython already - SciPy, pandas, scikit-learn, and spaCy have large chunks that are already written in Cython. 

Since Cython is a superset of Python, if you don’t want to use the compiled features or extensions, you don’t have to use the compiler features or extensions. You can simply use the Python subset. It only adds, never subtracts. 

It may actually make our Python code safer, because our code is now obfuscated by compilation

Background 

You may have heard that Python is an interpreted language. Really though, the line between compilers and interpreters is blurry. The Python interpreter actually performs two actions when reading your source code: 

  1. Lexes, parses, and compiles your source code into code objects. These are also called bytecode. This is what’s in those .pyc files, usually found in the _pycache_ directory. It’s not human-readable, but it can be disassembled. 
  2. The Python interpreter actually interprets this bytecode. 

So running Python involves both compiling (into bytecode) and interpretation (just reading bytecode line by line). Bytecode is an intermediate version of Python. 

The most common interpreter for Python is written in C, and is called CPython. CPython (the standard C implementation of Python) is different from Cython. However, Cython does depend on CPython. 

How does the CPython interpreter work? It’s a virtual machine (VM) - which is software that emulates a real machine or computer. Python’s VM is actually a stack machine. Stacks are FIFO data structures (first-in, first-out). 

It’s probably easier to just show you how it works. 

LOAD_GLOBAL takes the value at index 0 and pushes it onto the evaluation stack. In this case, the name of the built-in global max function. 

LOAD_FAST takes the value at 0 and pushes it onto the eval stack as a local variable. 

The eval stack now looks like [“max”, “x”] # ← top of stack 

CALL_FUNCTION 1 pops the top item (x) off the stack and then pops the next item, the name of the function (max) off the stack, and then calls the function with the argument. It pushes the result back onto the stack. 

The whole thing happens again with y. So now both x and y are on the stack. Then BINARY_SUBTRACT is called, which implicitly assumes the top 2 elements on the stack are the values to be subtracted. This value is put onto the stack, and then it’s returned immediately with the RETURN VALUE instruction. 

I didn’t even go into the fact that there’s a function stack which records the current function and to where the function should return when it completes

Cython 

What is Cython? Two things: 

  1. A language that blends Python with the static type system of C and C++ 
  2. A compiler that converts Cython source code into C or C++. This code can be compiled into a Python extension (and imported like a module) 

It’s right between high-level Python and low-level C. (I’m old. People used to call C “high level”.) 

Cython is a superset of Python--it does everything Python already does, plus extension support. 

What’s so great about C? About 50 years of expertise in optimizing its speed! So Cython gives you the extreme flexibility and ease of Python with the performance of C. Best of both worlds. 

SpaCy uses Cython! (In fact, it’s the “Cy” in “SpaCy”.) Other projects you may have heard about use it too: 

CPython (the C implementation of Python) has a C API, which lets you interface C into Python. Cython is like a very polished wrapper around this. 

You can take a fast C or C++ library and use it from Python, or you can take Python that needs to perform better and get a speed boost. 

Example 

Let’s pretend parsing a CSV is easy and that we just want a space-separated concatenation of all the words in the 4th column. Here’s the first few lines of such a CSV: 

A very simple Python version may look like:   Let’s give this a 100,000-word transcript to reconstruct. How does it perform? 

9 ms, most of the time spent splitting lines. 

Let’s try Cythonizing this, not changing a single thing about the Python. Cython will create C/C++ code that gets compiled to assembly code. To do this, after installing Cython: 

First, copy your file but now give it the extension .pyx (meaning “Python extension”). Then, create a setup file setup.py to build the module: 

Now build your module: 

Now run it by importing the module like any other: 

 

When I profile this, here’s what I get: 

50 ms. For doing nothing extra besides a compilation. 

Why is Cython so much faster for the same file? In general, Python is slower than C / C++ / C# because:

  1. Function overhead. Invoking functions is expensive in Python relative to C 
  2. Python loops are way slower than C’s, although numpy helps 
  3. Python probably doesn’t always take type information into account for math (is “a+b” a float plus int? int plus int? float+float? who knows). Python has to figure out what the types are, unbox them into C, do the math, rebox into a new double - this takes time 
  4. Objects in Python are all dynamically allocated in the heap, and they’re all immutable, so there’s overhead in creation and destruction. Heap allocation is usually way slower than stack allocation which you can get in compiled languages 

Here’s what is generated by Cython: 

The right column is assembly code which gets translated into the raw machine language of the computer it’s on (the hex digits on the left). Assembly code describes what a CPU does (MOV, JMP, etc). An assembler takes this code and emits machine language. 

There’s a strong analogy between C generating assembly code and Python interpreters generating bytecode. 

Let’s try one more optimization: turns out someone built a really fast C++ CSV parser, and I want to call it from Python. No problem! 

First, I created a C++ wrapper class to isolate the functionality I needed from the fast CSV parser. This is stored in a header file, concat.h, which contains all the function declarations your code needs (a declaration tells C++ about data types of variables): 

My extension code this time looks more complicated: 

These functions are defined in the source code file for my wrapper class: 

 

Once you get it all working, it outperforms both tests so far: 

These differences may not seem like much, but let’s scale up the transcript to contain 9M words. How do they perform? 

The savings stack up! 

Cython in its own words: 

Cython is a compiler which compiles Python-like code files to C code. Still, "Cython is not a Python to C translator". That is, it doesn’t take your full program and “turns it into C” – rather, the result makes full use of the Python runtime environment. A way of looking at it may be that your code is still Python in that it runs within the Python runtime environment, but rather than compiling to interpreted Python bytecode one compiles to native machine code (but with the addition of extra syntax for easy embedding of faster C-like code). 

This has two important consequences: 

  • Speed. How much speed gain you achieve depends very much on what your code is doing. Typical Python numerical programs would tend to gain very little as most time is spent in lower-level C that is used in a high-level fashion. However for-loop-style programs can gain many orders of magnitude, when typing information is added (and is so made possible as a realistic alternative). 
  • Easy calling into C code. One of Cython’s purposes is to allow easy wrapping of C libraries. When writing code in Cython you can call into C code as easily as into Python code. 

Similar projects: 

  • ctypes. This is a native Python module for calling C directly inside Python. It’s very easy to use, but the more you use it, the more you’re calling C from Python…so why not just compile it all in C at that point and get a performance boost by bypassing the calling overhead? Plus then you can build your own custom module with the functionality we need. This article by a core Cython developer discusses the advantages of static compilation vs dynamic binding--facts that are clear to the many C# developers in our company. 
  • Numba. This is a just-in-time compiler that compiles Python and NumPy into machine code--excellent for numerical computations. It’s very good for parallelizing code for CPUs and GPUs. It’s very good for numerical computations, and is something we should consider as well! But Cython is also very good for numerical work in addition to some of the non-numerical bottlenecks we may have. 

CallMiner Research Lab Artificial Intelligence EMEA North America