fletcher package

Module contents

class fletcher.FletcherBaseArray

Bases: fletcher.string_mixin.StringSupportingExtensionArray

Pandas ExtensionArray implementation base backed by an Apache Arrow structure.

Attributes
T
base

Return base object of the underlying data.

dtype

Return the ExtensionDtype of this array.

nbytes

The number of bytes needed to store this object in memory.

ndim

Return the number of dimensions of the underlying data.

shape

Return the shape of the data.

size

Return the number of elements in this array.

Methods

all([skipna])

Compute whether all boolean values are True.

any([skipna])

Compute whether any boolean value is True.

argmax()

Return the index of maximum value.

argmin()

Return the index of minimum value.

argsort([ascending, kind, na_position])

Return the indices that would sort this array.

astype(dtype[, copy])

Cast to a NumPy array with ‘dtype’.

copy()

Return a copy of the array.

dropna()

Return ExtensionArray without NA values.

equals(other)

Return if another array is equivalent to this array.

factorize([na_sentinel])

Encode the extension array as an enumerated type.

fillna([value, method, limit])

Fill NA/NaN values using the specified method.

isna()

Boolean NumPy array indicating if each value is missing.

ravel([order])

Return a flattened view on this array.

repeat(repeats[, axis])

Repeat elements of a ExtensionArray.

searchsorted(value[, side, sorter])

Find indices where elements should be inserted to maintain order.

shift([periods, fill_value])

Shift values by desired number.

sum([skipna])

Return the sum of the values.

take(indices, *[, allow_fill, fill_value])

Take elements from an array.

to_numpy([dtype, copy, na_value])

Convert to a NumPy ndarray.

transpose(*axes)

Return a transposed view on this array.

unique()

Compute the ExtensionArray of unique values.

value_counts([dropna])

Return a Series containing counts of each unique value.

view([dtype])

Return a view on the array.

all(skipna: bool = False) → Optional[bool]

Compute whether all boolean values are True.

any(skipna: bool = False, **kwargs) → Optional[bool]

Compute whether any boolean value is True.

astype(dtype, copy=True)

Cast to a NumPy array with ‘dtype’.

Parameters
dtypestr or dtype

Typecode or data-type to which the array is cast.

copybool, default True

Whether to copy the data, even if not necessary. If False, a copy is made only if the old dtype does not match the new dtype.

Returns
arrayndarray

NumPy ndarray with ‘dtype’ for its dtype.

property base

Return base object of the underlying data.

property dtype

Return the ExtensionDtype of this array.

isna() → numpy.ndarray

Boolean NumPy array indicating if each value is missing.

This should return a 1-D array the same length as ‘self’.

property ndim

Return the number of dimensions of the underlying data.

property shape

Return the shape of the data.

property size

Return the number of elements in this array.

Returns
sizeint
sum(skipna: bool = True)

Return the sum of the values.

unique()

Compute the ExtensionArray of unique values.

It relies on the Pyarrow.ChunkedArray.unique and if it fails, comes back to the naive implementation.

Returns
uniquesExtensionArray
value_counts(dropna: bool = True) → pandas.core.series.Series

Return a Series containing counts of each unique value.

Parameters
dropnabool, default True

Don’t include counts of missing values.

Returns
countsSeries

See also

Series.value_counts
class fletcher.FletcherBaseDtype(arrow_dtype: pyarrow.lib.DataType)

Bases: pandas.core.dtypes.base.ExtensionDtype

Dtype base for a pandas ExtensionArray backed by an Apache Arrow structure.

Attributes
itemsize
kind

Return a character code (one of ‘biufcmMOSUV’), default ‘O’.

name

Return a string identifying the data type.

names

Ordered list of field names, or None if there are no fields.

type

Return the scalar type for the array, e.g.

Methods

construct_array_type()

Return the array type associated with this dtype.

construct_from_string(string)

Construct this type from a string.

example()

Get a simple array with example content.

is_dtype(dtype)

Check if we match ‘dtype’.

example()

Get a simple array with example content.

property itemsize
property kind

Return a character code (one of ‘biufcmMOSUV’), default ‘O’.

This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.

See also

numpy.dtype.kind
na_value = <NA>
property name

Return a string identifying the data type.

Will be used for display in, e.g. Series.dtype

property type

Return the scalar type for the array, e.g. int.

It’s expected ExtensionArray[item] returns an instance of ExtensionDtype.type for scalar item.

class fletcher.FletcherChunkedArray(array, dtype=None, copy=None)

Bases: fletcher.base.FletcherBaseArray

Pandas ExtensionArray implementation backed by Apache Arrow.

Attributes
T
base

Return base object of the underlying data.

dtype

Return the ExtensionDtype of this array.

nbytes

Return the number of bytes needed to store this object in memory.

ndim

Return the number of dimensions of the underlying data.

shape

Return the shape of the data.

size

Return the number of elements in this array.

Methods

all([skipna])

Compute whether all boolean values are True.

any([skipna])

Compute whether any boolean value is True.

argmax()

Return the index of maximum value.

argmin()

Return the index of minimum value.

argsort([ascending, kind, na_position])

Return the indices that would sort this array.

astype(dtype[, copy])

Cast to a NumPy array with ‘dtype’.

copy()

Return a copy of the array.

dropna()

Return ExtensionArray without NA values.

equals(other)

Return if another array is equivalent to this array.

factorize([na_sentinel])

Encode the extension array as an enumerated type.

fillna([value, method, limit])

Fill NA/NaN values using the specified method.

flatten()

Flatten the array.

isna()

Boolean NumPy array indicating if each value is missing.

ravel([order])

Return a flattened view on this array.

repeat(repeats[, axis])

Repeat elements of a ExtensionArray.

searchsorted(value[, side, sorter])

Find indices where elements should be inserted to maintain order.

shift([periods, fill_value])

Shift values by desired number.

sum([skipna])

Return the sum of the values.

take(indices[, allow_fill, fill_value])

Take elements from an array.

to_numpy([dtype, copy, na_value])

Convert to a NumPy ndarray.

transpose(*axes)

Return a transposed view on this array.

unique()

Compute the ExtensionArray of unique values.

value_counts([dropna])

Return a Series containing counts of each unique value.

view([dtype])

Return a view on the array.

copy() → pandas.core.arrays.base.ExtensionArray

Return a copy of the array.

Parameters
deepbool, default False

Also copy the underlying data backing this array.

Returns
ExtensionArray
factorize(na_sentinel=- 1)

Encode the extension array as an enumerated type.

Parameters
na_sentinelint, default -1

Value to use in the codes array to indicate missing values.

Returns
codesndarray

An integer NumPy array that’s an indexer into the original ExtensionArray.

uniquesExtensionArray

An ExtensionArray containing the unique values of self.

Note

uniques will not contain an entry for the NA value of the ExtensionArray if there are any missing values present in self.

See also

factorize

Top-level factorize method that dispatches here.

Notes

pandas.factorize() offers a sort keyword as well.

fillna(value=None, method=None, limit=None)

Fill NA/NaN values using the specified method.

Parameters
valuescalar, array-like

If a scalar value is passed it is used to fill all missing values. Alternatively, an array-like ‘value’ can be given. It’s expected that the array-like have the same length as ‘self’.

method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap

limitint, default None

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.

Returns
filledExtensionArray with NA/NaN filled
flatten()

Flatten the array.

property nbytes

Return the number of bytes needed to store this object in memory.

take(indices: Union[Sequence[int], numpy.ndarray], allow_fill: bool = False, fill_value: Optional[Any] = None) → pandas.core.arrays.base.ExtensionArray

Take elements from an array.

Parameters
indicessequence of integers

Indices to be taken.

allow_fillbool, default False

How to handle negative values in indices. * False: negative values in indices indicate positional indices

from the right (the default). This is similar to numpy.take().

  • True: negative values in indices indicate missing values. These values are set to fill_value. Any other other negative values raise a ValueError.

fill_valueany, optional

Fill value to use for NA-indices when allow_fill is True. This may be None, in which case the default NA value for the type, self.dtype.na_value, is used. For many ExtensionArrays, there will be two representations of fill_value: a user-facing “boxed” scalar, and a low-level physical NA value. fill_value should be the user-facing version, and the implementation should handle translating that to the physical version for processing the take if nescessary.

Returns
ExtensionArray
Raises
IndexError

When the indices are out of bounds for the array.

ValueError

When indices contains negative values other than -1 and allow_fill is True.

See also

numpy.take
pandas.api.extensions.take

Notes

ExtensionArray.take is called by Series.__getitem__, .loc, iloc, when indices is a sequence of values. Additionally, it’s called by Series.reindex(), or any other method that causes realignemnt, with a fill_value.

class fletcher.FletcherChunkedDtype(arrow_dtype: pyarrow.lib.DataType)

Bases: fletcher.base.FletcherBaseDtype

Dtype for a pandas ExtensionArray backed by Apache Arrow’s pyarrow.ChunkedArray.

Attributes
itemsize
kind

Return a character code (one of ‘biufcmMOSUV’), default ‘O’.

name

Return a string identifying the data type.

names

Ordered list of field names, or None if there are no fields.

type

Return the scalar type for the array, e.g.

Methods

construct_array_type(*args)

Return the array type associated with this dtype.

construct_from_string(string)

Attempt to construct this type from a string.

example()

Get a simple array with example content.

is_dtype(dtype)

Check if we match ‘dtype’.

classmethod construct_array_type(*args) → Type[fletcher.base.FletcherChunkedArray]

Return the array type associated with this dtype.

Returns
type
classmethod construct_from_string(string: str)fletcher.base.FletcherChunkedDtype

Attempt to construct this type from a string.

Parameters
stringstr
Returns
selfinstance of ‘cls’
Raises
TypeError

If a class cannot be constructed from this ‘string’.

Examples

If the extension dtype can be constructed without any arguments, the following may be an adequate implementation. >>> @classmethod … def construct_from_string(cls, string) … if string == cls.name: … return cls() … else: … raise TypeError(“Cannot construct a ‘{}’ from ” … “’{}’”.format(cls, string))

class fletcher.FletcherContinuousArray(array, dtype=None, copy: Optional[bool] = None)

Bases: fletcher.base.FletcherBaseArray

Pandas ExtensionArray implementation backed by Apache Arrow’s pyarrow.Array.

Attributes
T
base

Return base object of the underlying data.

dtype

Return the ExtensionDtype of this array.

nbytes

Return the number of bytes needed to store this object in memory.

ndim

Return the number of dimensions of the underlying data.

shape

Return the shape of the data.

size

Return the number of elements in this array.

Methods

all([skipna])

Compute whether all boolean values are True.

any([skipna])

Compute whether any boolean value is True.

argmax()

Return the index of maximum value.

argmin()

Return the index of minimum value.

argsort([ascending, kind, na_position])

Return the indices that would sort this array.

astype(dtype[, copy])

Cast to a NumPy array with ‘dtype’.

copy()

Return a copy of the array.

dropna()

Return ExtensionArray without NA values.

equals(other)

Return if another array is equivalent to this array.

factorize([na_sentinel])

Encode the extension array as an enumerated type.

fillna([value, method, limit])

Fill NA/NaN values using the specified method.

flatten()

Flatten the array.

isna()

Boolean NumPy array indicating if each value is missing.

ravel([order])

Return a flattened view on this array.

repeat(repeats[, axis])

Repeat elements of a ExtensionArray.

searchsorted(value[, side, sorter])

Find indices where elements should be inserted to maintain order.

shift([periods, fill_value])

Shift values by desired number.

sum([skipna])

Return the sum of the values.

take(indices[, allow_fill, fill_value])

Take elements from an array.

to_numpy([dtype, copy, na_value])

Convert to a NumPy ndarray.

transpose(*axes)

Return a transposed view on this array.

unique()

Compute the ExtensionArray of unique values.

value_counts([dropna])

Return a Series containing counts of each unique value.

view([dtype])

Return a view on the array.

copy() → pandas.core.arrays.base.ExtensionArray

Return a copy of the array.

Currently is a shadow copy - pyarrow array are supposed to be immutable.

Returns
ExtensionArray
factorize(na_sentinel=- 1)

Encode the extension array as an enumerated type.

Parameters
na_sentinelint, default -1

Value to use in the codes array to indicate missing values.

Returns
codesndarray

An integer NumPy array that’s an indexer into the original ExtensionArray.

uniquesExtensionArray

An ExtensionArray containing the unique values of self.

Note

uniques will not contain an entry for the NA value of the ExtensionArray if there are any missing values present in self.

See also

factorize

Top-level factorize method that dispatches here.

Notes

pandas.factorize() offers a sort keyword as well.

fillna(value=None, method=None, limit=None)

Fill NA/NaN values using the specified method.

Parameters
valuescalar, array-like

If a scalar value is passed it is used to fill all missing values. Alternatively, an array-like ‘value’ can be given. It’s expected that the array-like have the same length as ‘self’.

method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap

limitint, default None

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.

Returns
filledExtensionArray with NA/NaN filled
flatten()

Flatten the array.

property nbytes

Return the number of bytes needed to store this object in memory.

take(indices: Union[Sequence[int], numpy.ndarray], allow_fill: bool = False, fill_value: Optional[Any] = None) → pandas.core.arrays.base.ExtensionArray

Take elements from an array.

Parameters
indicessequence of integers

Indices to be taken.

allow_fillbool, default False

How to handle negative values in indices. * False: negative values in indices indicate positional indices

from the right (the default). This is similar to numpy.take().

  • True: negative values in indices indicate missing values. These values are set to fill_value. Any other other negative values raise a ValueError.

fill_valueany, optional

Fill value to use for NA-indices when allow_fill is True. This may be None, in which case the default NA value for the type, self.dtype.na_value, is used. For many ExtensionArrays, there will be two representations of fill_value: a user-facing “boxed” scalar, and a low-level physical NA value. fill_value should be the user-facing version, and the implementation should handle translating that to the physical version for processing the take if nescessary.

Returns
ExtensionArray
Raises
IndexError

When the indices are out of bounds for the array.

ValueError

When indices contains negative values other than -1 and allow_fill is True.

See also

numpy.take
pandas.api.extensions.take

Notes

ExtensionArray.take is called by Series.__getitem__, .loc, iloc, when indices is a sequence of values. Additionally, it’s called by Series.reindex(), or any other method that causes realignemnt, with a fill_value.

class fletcher.FletcherContinuousDtype(arrow_dtype: pyarrow.lib.DataType)

Bases: fletcher.base.FletcherBaseDtype

Dtype for a pandas ExtensionArray backed by Apache Arrow’s pyarrow.Array.

Attributes
itemsize
kind

Return a character code (one of ‘biufcmMOSUV’), default ‘O’.

name

Return a string identifying the data type.

names

Ordered list of field names, or None if there are no fields.

type

Return the scalar type for the array, e.g.

Methods

construct_array_type(*args)

Return the array type associated with this dtype.

construct_from_string(string)

Attempt to construct this type from a string.

example()

Get a simple array with example content.

is_dtype(dtype)

Check if we match ‘dtype’.

classmethod construct_array_type(*args)

Return the array type associated with this dtype.

Returns
type
classmethod construct_from_string(string: str)

Attempt to construct this type from a string.

Parameters
string
Returns
selfinstance of ‘cls’
Raises
TypeError

If a class cannot be constructed from this ‘string’.

Examples

If the extension dtype can be constructed without any arguments, the following may be an adequate implementation. >>> @classmethod … def construct_from_string(cls, string) … if string == cls.name: … return cls() … else: … raise TypeError(“Cannot construct a ‘{}’ from ” … “’{}’”.format(cls, string))

class fletcher.TextAccessor(obj)

Bases: fletcher.string_array.TextAccessorBase

Accessor for pandas exposed as .fr_strx.

Methods

cat(others)

Concatenate strings in the Series/Index with given separator.

contains(pat[, case, regex])

Test if pattern or regex is contained within a string of a Series or Index.

endswith(pat)

Check whether a row ends with a certain pattern.

replace(pat, repl[, n, case, regex])

Replace occurrences of pattern/regex in the Series/Index with some other string.

slice([start, end, step])

Extract every step character from strings from start to end.

startswith(pat)

Check whether a row starts with a certain pattern.

strip([to_strip])

Strip whitespaces from both ends of strings.

zfill(width)

Pad strings in the Series/Index by prepending ‘0’ characters.

count

isalnum

isalpha

isdecimal

isdigit

islower

isnumeric

isspace

istitle

isupper

cat(others: Optional[fletcher.base.FletcherBaseArray]) → pandas.core.series.Series

Concatenate strings in the Series/Index with given separator.

If others is specified, this function concatenates the Series/Index and elements of others element-wise. If others is not passed, then all values in the Series/Index are concatenated into a single string with a given sep.

contains(pat: str, case: bool = True, regex: bool = True) → pandas.core.series.Series

Test if pattern or regex is contained within a string of a Series or Index.

Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.

This implementation differs to the one in pandas:
  • We always return a missing for missing data.

  • You cannot pass flags for the regular expression module.

Parameters
patstr

Character sequence or regular expression.

casebool, default True

If True, case sensitive.

regexbool, default True

If True, assumes the pat is a regular expression.

If False, treats the pat as a literal string.

Returns
Series or Index of boolean values

A Series or Index of boolean values indicating whether the given pattern is contained within the string of each element of the Series or Index.

count(pat: str, regex: bool = True) → pandas.core.series.Series
endswith(pat)

Check whether a row ends with a certain pattern.

isalnum()
isalpha()
isdecimal()
isdigit()
islower()
isnumeric()
isspace()
istitle()
isupper()
replace(pat: str, repl: str, n: int = - 1, case: bool = True, regex: bool = True)

Replace occurrences of pattern/regex in the Series/Index with some other string. Equivalent to str.replace() or re.sub().

Return а string Series where in each row the occurrences of the given pattern or regex pat are replaced by repl.

This implementation differs to the one in pandas:
  • We always return a missing for missing data.

  • You cannot pass flags for the regular expression module.

Parameters
patstr

Character sequence or regular expression.

replstr

Replacement string.

nint

Number of replacements to make from start.

casebool, default True

If True, case sensitive.

regexbool, default True

If True, assumes the pat is a regular expression. If False, treats the pat as a literal string.

Returns
Series of string values.
slice(start=0, end=None, step=1)

Extract every step character from strings from start to end.

startswith(pat)

Check whether a row starts with a certain pattern.

strip(to_strip=None)

Strip whitespaces from both ends of strings.

zfill(width: int) → pandas.core.series.Series

Pad strings in the Series/Index by prepending ‘0’ characters.

fletcher.pandas_from_arrow(arrow_object: Union[pyarrow.lib.RecordBatch, pyarrow.lib.Table, pyarrow.lib.Array, pyarrow.lib.ChunkedArray], continuous: bool = False)

Convert Arrow object instance to their Pandas equivalent by using Fletcher.

The conversion rules are:
  • {RecordBatch, Table} -> DataFrame

  • {Array, ChunkedArray} -> Series

Parameters
arrow_objectRecordBatch, Table, Array or ChunkedArray

object to be converted

continuousbool

Use FletcherContinuousArray instead of FletcherChunkedArray

fletcher.read_parquet(path, columns: Optional[List[str]] = None, continuous: bool = False) → pandas.core.frame.DataFrame

Load a parquet object from the file path, returning a DataFrame with fletcher columns.

Parameters
pathstr or file-like
continuousbool

Use FletcherContinuousArray instead of FletcherChunkedArray

Returns
pd.DataFrame