Optimising Cython code

Writing fast Cython code is a large topic, with many options and tricks available to users, so we will not cover everything here. See the links in Useful resources for more details. Only the key aspects of Cython optimising with be detailed here.

Typing variables and returns

One of the major reasons why Python is slow compared to statically typed languages is that all variables, function arguments and function return types are PyObject* instances in CPython (unless otherwise stated). Python then requires that various book-keeping tasks must be performed on these objects (updating reference counters and adding tracebacks) when they are instantiated and returned from functions.

To avoid requiring these tasks to be performed if they don’t need to be, then one can (and should) explicitly type any variable and/or return type where possible. Types of variables can be declared inside function bodies

cimport numpy as np # required for ndarray
import numpy as np # for zeros function

def foo():
    cdef:
        # declare x as an int
        int x

        # declare y and z as double-precision floating point
        double y, z

        # declare eps as a double and initialise
        double eps = 1e-10

        # declare mat as a NumPy array with 2 dimensions and initialise as a 10x10 matrix of zeros
        np.ndarray[int, ndim=2] mat = np.zeros((10, 10), dtype=np.intc)

    # ...

And as arguments to any function (note: passing arrays in this way is generally not the preferred method for |Finesse| Cython code - typed memory views are preferred, see Typed memory-views)

cimport numpy as np

def bar(int x, double y, np.ndarray[double, ndim=2] mat)

And, finally, also as return types of cdef and cpdef functions (def functions cannot have typed returns as they must always return PyObject*)

cpdef double mean(np.ndarray[double, ndim=1] a):
    cdef:
        Py_ssize_t i
        Py_ssize_t N = a.shape[0]
        double sum = 0.0

    for i in range(N):
        sum += a[i]

    return sum / N

Note the use of the type Py_ssize_t here - this type should always be given to variables which are used for indexing arrays.

Writing fast for-loops

In order to allow Cython to convert a for-loop into pure C, all loop incrementing and limit variables should be declared with cdef (if any of these are left untyped then they are assumed to be Python objects and so the loop cannot be translated to a simple, pure C loop). An example of this is shown in the mean function above - both i and N (which are used as the incrementing/indexing and limit variables, respectively) are declared as the type Py_ssize_t.

Typed memory-views

Typed memoryviews are the preferred objects for interacting with NumPy arrays in the Finesse Cython extensions. They provide efficient access to the memory buffers underlying NumPy arrays without incurring any Python overhead, and have more features and a cleaner syntax than using ndarray[Ty, ndim=n] the style syntax we saw above. For more information, and a full feature list, see the Typed memoryview documentation.

The most important aspects of memoryviews are summarised here:

Declare as
cdef Ty[:] view_1D = exporting_object
for a one-dimensional memoryview on some array of the same type Ty.

For a higher-dimensional view use, e.g

cdef Ty[:, :] view_2D = exporting_object

Pass a view as an argument with

def do_array_operation(double[:] array_view):
    ...

All the arrays that we work with will have a contiguous C layout (the default memory structure of NumPy arrays). This means we should declare memoryviews with the
cdef Ty[::1] contig_view = exporting_contig_object
syntax - where this creates a 1D contiguous view on the memory buffer assigned to.
This, of course, works for higher dimensional arrays too (here the elements in the second-dimension are one element, sizeof(double), apart in memory - which is true for multi-dimensional C arrays)
def operation_on_matrix(double[:, ::1] contig_matrix_view): ...

Multi-threading

Cython has support for OpenMP via the sub-module cython.parallel, which contains convenient wrappers such as the function prange. Since we perform a lot of elementwise operations in Finesse, we can make use of this feature of Cython to distribute the work among multiple threads.

A simple example of using Cython parallelism is shown here, this silly example was chosen just to highlight which dependencies need to be declared nogil

cimport numpy as np
import numpy as np
from cython.parallel import prange

cdef extern from "math.h":
    # functions from C headers must also be declared with nogil
    # for use in multi-threaded Cython code
    double cos(double arg) nogil
    double sin(double arg) nogil

# declare this plain C function nogil as it is
# used within the body of a prange for loop
cdef double sin_cos_square(double x) nogil:
    return sin(x) * sin(x) + cos(x) * cos(x)

cpdef np.ndarray[double, ndim=2] transform(double[:, ::1] mat):
    cdef:
        Py_ssize_t i, j
        Py_ssize_t N = mat.shape[0]
        np.ndarray[double, ndim=2] out = np.zeros(mat.shape)

    # we use a prange here, with nogil flag set to True to indicate
    # that the loop body must be executed with the GIL released
    for i in prange(N, nogil=True):
        for j in range(N):
            out[i][j] = sin_cos_square(mat[i][j])

    return out

Compiler directives

Compiler directives are instructions which can be passed to Cython which modify the behaviour of certain parts of the code. Some of the available options (listed here) can be used to improve performance.

One of the key directives which we apply by default to all Cython extensions in Finesse is the cdivision directive. This has the effect of translating all division operations in the Cython code to C style divisions - most importantly, this means that no zero-checking is performed on the denominator.

Other performance improving directives, which are used in certain extensions or as decorators to specific functions in Finesse, are listed here for convenience:

boundscheck = False: no checks on indices being out of bounds are carried out on arrays or memory-views.
wraparound = False: disables the use of negative indices being passed to arrays and memory-views. If a negative index is passed then no checks are performed resulting in Undefined Behaviour (likely manifesting as segfaults).
initializedcheck = False: disables checks on whether memory-views have been initialised on every access operation.

To apply a directive to a body of code within a function

cimport cython

def foo(double[:, ::1] mat):
    cdef:
        Py_ssize_t i, j
        Py_ssize_t N = mat.shape[0]

    with cython.boundscheck(False):
        for i in range(N):
            for j in range(N):
                double element = mat[i][j]
                # ...

Or to a function as a whole

cimport cython

@cython.boundscheck(False)
def foo(double[:, ::1] mat):
    cdef:
        Py_ssize_t i, j
        Py_ssize_t N = mat.shape[0]

    for i in range(N):
        for j in range(N):
            double element = mat[i][j]
            # ...

Or to an extension (.pyx file) in entirety

#cython: boundscheck=False

def foo(double[:, ::1] mat):
    cdef:
        Py_ssize_t i, j
        Py_ssize_t N = mat.shape[0]

    for i in range(N):
        for j in range(N):
            double element = mat[i][j]
            # ...

And, finally, to all extensions (globally) via setup.py

# exts is a list of Cython Extension objects
cythonize(exts,
          compiler_directives={
              "boundscheck" : False
          })

Note that this is not recommended for arbitrary directives unless it is agreed upon by the Finesse development team as a whole. Only directives which improve performance without having unexpected side-effects should be applied globally (e.g. cdivision).