Accessing Data from Python's DataFrame Interchange Protocol

May 14, 2023
cythondataframepython

Python's DataFrame interchange protocol specifies a zero-copy data interchange between Python DataFrame libraries, such as Pandas, Vaex, and Polars. This blog post explores how to read data from the DataFrame Interchange Protocol and perform a simple computation using Python's ctypes module. We use Cython to access the data without the GIL and perform the same calculation. This blog post is runnable as a notebook on Google Colab.

Polars DataFrame and the Exchange Protocol

First, we create a small DataFrame with a single column and missing values using Polars:

import polars as pl

df = pl.DataFrame(
    {
        "first": [None, 1, 2, 3, 8, None, 1, None, 10, -2, -1],
    },
    schema={"first": pl.Int64}
)

The __dataframe__ method of a DataFrame returns an object that implements the DataFrame Interchange Protocol:

df_protocol = df.__dataframe__()

We get the column from the API specification and access the buffer that contains the data:

column = df_protocol.get_column_by_name("first")
buffer = column.get_buffers()

The buffer object is a dictionary composed of buffers representing the data and validity:

from pprint import pprint

pprint(buffer)
{'data': (PyArrowBuffer({'bufsize': 88, 'ptr': 4491764957280, 'device': 'CPU'}),
          (<DtypeKind.INT: 0>, 64, 'l', '=')),
 'offsets': None,
 'validity': (PyArrowBuffer({'bufsize': 2, 'ptr': 4491764629696, 'device': 'CPU'}),
              (<DtypeKind.BOOL: 20>, 1, 'b', '='))}

The 'data' entry consists of the underlying data's buffer and type. We compute the number of items in the buffer by dividing the buffer's size by the data type's size:

buffer_size_in_bits = buffer["data"][0].bufsize * 8
buffer_dtype_size = buffer["data"][1][1]
n_items = buffer_size_in_bits // buffer_dtype_size
print(n_items)
11

With the ctypes module, we access the buffer using the pointer address:

import ctypes
data = (ctypes.c_int64 * n_items).from_address(buffer["data"][0].ptr)

print(list(data))
[0, 1, 2, 3, 8, 0, 1, 0, 10, -2, -1]

Validity buffer

This array does not give the whole picture of the column. The original column contains null values, represented by a mask stored in the validity buffer. The interchange API tells us that the validity buffer is a bit mask:

column.describe_null
Out[9]:
(<ColumnNullType.USE_BITMASK: 3>, 0)

Looking at the validity buffer, we see that the buffer size is 2 bytes and the data type has a size of 1 bit, which is consistent with being a bit mask:

pprint(buffer["validity"])
(PyArrowBuffer({'bufsize': 2, 'ptr': 4491764629696, 'device': 'CPU'}),
 (<DtypeKind.BOOL: 20>, 1, 'b', '='))

There are no ctypes that represent one bit of data. However, we can use unsigned 8-bit integers to store the validity buffer:

n_items_validity = buffer["validity"][0].bufsize
validity = (ctypes.c_uint8 * n_items_validity).from_address(buffer["validity"][0].ptr)

The 16 bits are enough space to store the bit-mask of the original 11 bits. We use bit-wise operations to access the bit mask:

for i in range(n_items):
    val_idx = i // 8
    val_remainer = i % 8
    val = (validity[val_idx] >> val_remainer) & 1

    end = ", " if i < n_items - 1 else ""
    print(f"{val}", end=end)
0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1

Computing the nan mean

With the data and validity buffer, we can use Python to perform the meanwhile ignoring the null values:

def nan_mean(data, validity):
    total = 0.0
    count = 0
    for i in range(len(data)):
        val_idx = i // 8
        val_remainder = i % 8
        val = (validity[val_idx] >> val_remainder) & 1
        if val:
            total += data[i]
            count += 1
    return total / count

print(nan_mean(data, validity))
2.75

This value is consistent with the value computed using Polars from the original DataFrame, which also ignores the null values:

print(df.mean())
shape: (1, 1)
┌───────┐
│ first │
│ ---   │
│ f64   │
╞═══════╡
│ 2.75  │
└───────┘

One awesome fact about the ctype objects is that they also implement Python's Buffer Protocol. With the Buffer Protocol, we can write a Cython function with memoryviews to perform the nan-mean, while releasing the GIL:

%load_ext Cython
%%cython
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def nan_mean_cython(long[::1] array, unsigned char[::1] validity):
    cdef:
        Py_ssize_t idx, val_idx, val_remainder
        double output = 0.0
        Py_ssize_t count = 0

    with nogil:
        for idx in range(array.shape[0]):
            val_idx = idx // 8
            val_remainder = idx % 8
            if (validity[val_idx] >> val_remainder) & 1:
                output += array[idx]
                count += 1

    return output / count
nan_mean_cython(data, validity)
Out[17]:
2.75

Why?

Python's DataFrame Interchange Protocol provides a uniform API for libraries to write for! In other words, the Cython function above works not only with Polars DataFrames but also with Pandas or Vaex DataFrames. Data access does not require the GIL, so we can release the gil and use native programming languages for acceleration!

Similar Posts

12/27/23
Python Extensions in Rust with Jupyter Notebooks
08/15/23
Quick NumPy UFuncs with Cython 3.0
09/12/18
Survival Regression Analysis on Customer Churn
07/31/18
Nuclei Image Segmentation Tutorial
08/28/17
Rodents Of NYC