fletcher

Use Apache Arrow backed columns in Pandas 0.23+ using the ExtensionArray interface.

Fletcher provides a generic implementation of the ExtensionDtype and ExtensionArray interfaces of Pandas for columns backed by Apache Arrow. By using it you can use any data type available in Apache Arrow natively in Pandas. Most prominently, fletcher provides native String und List types.

fletcher provides two, slightly different implementations. There is FletcherChunkedArray which is based on pyarrow.ChunkedArray, i.e. it consists of a collection of one or more continuous pyarrow.Array instances. Thus the backing memory can be a single memory region but it isn’t required. This makes operations like concat copy-free as the result will be a ChunkedArray that consists of the union of the chunks of the inputs. In contrast it makes algorithm implementation a bit more complex as we need to implement all algorithms to iterate over all rows of all the arrays, not simply 0..n-1 of a single array.

The other implementation is FletcherContinuousArray which is based on a single pyarrow.Array instance. While this makes operations like concat more costly, it greatly improves usability and extensibility by being a much simpler structure. One can always assume that the backing memory region is a continuous block of memory and iterate with simple 0..n-1 indexing over the rows.

At the moment, we don’t provide a default FletcherArray-named implementation as we are uncertain which of the two above implementations will be the most accepted one. Once we know to which implementation users converge, we will name that one FletcherArray.

In addition to bringing an alternative memory backend to NumPy, fletcher also provides high-performance operations on the new column types. It will either use the native implementation of an algorithm if provided in pyarrow or otherwise provide an implementation by itself using Numba.

Usage of fletcher columns is straightforward using Pandas’ default constructor:

import fletcher as fr
import pandas as pd

df = pd.DataFrame({
    'str_chunked': fr.FletcherChunkedArray(['a', 'b', 'c']),
    'str_continuous': fr.FletcherContinuousArray(['a', 'b', 'c']),
})

df.info()

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 2 columns):
#  #   Column          Non-Null Count  Dtype
# ---  ------          --------------  -----
#  0   str_chunked     3 non-null      fletcher_chunked[string]
#  1   str_continuous  3 non-null      fletcher_continuous[string]
# dtypes: fletcher_chunked[string](1), fletcher_continuous[string](1)
# memory usage: 166.0 bytes

Indices and tables