fletcher package¶

Subpackages¶

fletcher.algorithms package

Submodules¶

Module contents¶

class fletcher.FletcherBaseArray¶

Bases: fletcher.string_mixin.StringSupportingExtensionArray

Pandas ExtensionArray implementation base backed by an Apache Arrow structure.

Attributes

T
base: Return base object of the underlying data.
dtype: Return the ExtensionDtype of this array.
nbytes: The number of bytes needed to store this object in memory.
ndim: Return the number of dimensions of the underlying data.
shape: Return the shape of the data.
size: Return the number of elements in this array.

Methods

`all`([skipna])	Compute whether all boolean values are True.
`any`([skipna])	Compute whether any boolean value is True.
`argmax`()	Return the index of maximum value.
`argmin`()	Return the index of minimum value.
`argsort`([ascending, kind, na_position])	Return the indices that would sort this array.
`astype`(dtype[, copy])	Cast to a NumPy array with ‘dtype’.
`copy`()	Return a copy of the array.
`dropna`()	Return ExtensionArray without NA values.
`equals`(other)	Return if another array is equivalent to this array.
`factorize`([na_sentinel])	Encode the extension array as an enumerated type.
`fillna`([value, method, limit])	Fill NA/NaN values using the specified method.
`isna`()	Boolean NumPy array indicating if each value is missing.
`ravel`([order])	Return a flattened view on this array.
`repeat`(repeats[, axis])	Repeat elements of a ExtensionArray.
`searchsorted`(value[, side, sorter])	Find indices where elements should be inserted to maintain order.
`shift`([periods, fill_value])	Shift values by desired number.
`sum`([skipna])	Return the sum of the values.
`take`(indices, *[, allow_fill, fill_value])	Take elements from an array.
`to_numpy`([dtype, copy, na_value])	Convert to a NumPy ndarray.
`transpose`(*axes)	Return a transposed view on this array.
`unique`()	Compute the ExtensionArray of unique values.
`value_counts`([dropna])	Return a Series containing counts of each unique value.
`view`([dtype])	Return a view on the array.

all(skipna: bool = False) → Optional[bool]¶: Compute whether all boolean values are True.

any(skipna: bool = False, **kwargs) → Optional[bool]¶: Compute whether any boolean value is True.

astype(dtype, copy=True)¶

Cast to a NumPy array with ‘dtype’.

Parameters

dtypestr or dtype: Typecode or data-type to which the array is cast.
copybool, default True: Whether to copy the data, even if not necessary. If False, a copy is made only if the old dtype does not match the new dtype.

Returns

arrayndarray: NumPy ndarray with ‘dtype’ for its dtype.

property base¶: Return base object of the underlying data.

property dtype¶: Return the ExtensionDtype of this array.

isna() → numpy.ndarray¶

Boolean NumPy array indicating if each value is missing.

This should return a 1-D array the same length as ‘self’.

property ndim¶: Return the number of dimensions of the underlying data.

property shape¶: Return the shape of the data.

property size¶

Return the number of elements in this array.

Returns

sizeint

sum(skipna: bool = True)¶: Return the sum of the values.

unique()¶

Compute the ExtensionArray of unique values.

It relies on the Pyarrow.ChunkedArray.unique and if it fails, comes back to the naive implementation.

Returns

uniquesExtensionArray

value_counts(dropna: bool = True) → pandas.core.series.Series¶

Return a Series containing counts of each unique value.

Parameters

dropnabool, default True: Don’t include counts of missing values.

Returns

countsSeries

See also

Series.value_counts

class fletcher.FletcherBaseDtype(arrow_dtype: pyarrow.lib.DataType)¶

Bases: pandas.core.dtypes.base.ExtensionDtype

Dtype base for a pandas ExtensionArray backed by an Apache Arrow structure.

Attributes

itemsize
kind: Return a character code (one of ‘biufcmMOSUV’), default ‘O’.
name: Return a string identifying the data type.
names: Ordered list of field names, or None if there are no fields.
type: Return the scalar type for the array, e.g.

Methods

`construct_array_type`()	Return the array type associated with this dtype.
`construct_from_string`(string)	Construct this type from a string.
`example`()	Get a simple array with example content.
`is_dtype`(dtype)	Check if we match ‘dtype’.

example()¶: Get a simple array with example content.

property itemsize¶

property kind¶

Return a character code (one of ‘biufcmMOSUV’), default ‘O’.

This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.

See also

numpy.dtype.kind

na_value = <NA>¶

property name¶

Return a string identifying the data type.

Will be used for display in, e.g. Series.dtype

property type¶

Return the scalar type for the array, e.g. int.

It’s expected ExtensionArray[item] returns an instance of ExtensionDtype.type for scalar item.

class fletcher.FletcherChunkedArray(array, dtype=None, copy=None)¶

Bases: fletcher.base.FletcherBaseArray

Pandas ExtensionArray implementation backed by Apache Arrow.

Attributes

T
base: Return base object of the underlying data.
dtype: Return the ExtensionDtype of this array.
nbytes: Return the number of bytes needed to store this object in memory.
ndim: Return the number of dimensions of the underlying data.
shape: Return the shape of the data.
size: Return the number of elements in this array.

Methods

`all`([skipna])	Compute whether all boolean values are True.
`any`([skipna])	Compute whether any boolean value is True.
`argmax`()	Return the index of maximum value.
`argmin`()	Return the index of minimum value.
`argsort`([ascending, kind, na_position])	Return the indices that would sort this array.
`astype`(dtype[, copy])	Cast to a NumPy array with ‘dtype’.
`copy`()	Return a copy of the array.
`dropna`()	Return ExtensionArray without NA values.
`equals`(other)	Return if another array is equivalent to this array.
`factorize`([na_sentinel])	Encode the extension array as an enumerated type.
`fillna`([value, method, limit])	Fill NA/NaN values using the specified method.
`flatten`()	Flatten the array.
`isna`()	Boolean NumPy array indicating if each value is missing.
`ravel`([order])	Return a flattened view on this array.
`repeat`(repeats[, axis])	Repeat elements of a ExtensionArray.
`searchsorted`(value[, side, sorter])	Find indices where elements should be inserted to maintain order.
`shift`([periods, fill_value])	Shift values by desired number.
`sum`([skipna])	Return the sum of the values.
`take`(indices[, allow_fill, fill_value])	Take elements from an array.
`to_numpy`([dtype, copy, na_value])	Convert to a NumPy ndarray.
`transpose`(*axes)	Return a transposed view on this array.
`unique`()	Compute the ExtensionArray of unique values.
`value_counts`([dropna])	Return a Series containing counts of each unique value.
`view`([dtype])	Return a view on the array.

copy() → pandas.core.arrays.base.ExtensionArray¶

Return a copy of the array.

Parameters

deepbool, default False: Also copy the underlying data backing this array.

Returns

ExtensionArray

factorize(na_sentinel=- 1)¶

Encode the extension array as an enumerated type.

Parameters

na_sentinelint, default -1: Value to use in the codes array to indicate missing values.

Returns

codesndarray: An integer NumPy array that’s an indexer into the original ExtensionArray.
uniquesExtensionArray: An ExtensionArray containing the unique values of self.

Note

uniques will not contain an entry for the NA value of the ExtensionArray if there are any missing values present in self.

See also

factorize: Top-level factorize method that dispatches here.

Notes

pandas.factorize() offers a sort keyword as well.

fillna(value=None, method=None, limit=None)¶

Fill NA/NaN values using the specified method.

Parameters

valuescalar, array-like: If a scalar value is passed it is used to fill all missing values. Alternatively, an array-like ‘value’ can be given. It’s expected that the array-like have the same length as ‘self’.
method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None: Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap
limitint, default None: If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.

Returns

filledExtensionArray with NA/NaN filled

flatten()¶: Flatten the array.

property nbytes¶: Return the number of bytes needed to store this object in memory.

take(indices: Union[Sequence[int], numpy.ndarray], allow_fill: bool = False, fill_value: Optional[Any] = None) → pandas.core.arrays.base.ExtensionArray¶

Take elements from an array.

Parameters

indicessequence of integers

Indices to be taken.

allow_fillbool, default False

How to handle negative values in indices. * False: negative values in indices indicate positional indices

from the right (the default). This is similar to numpy.take().

True: negative values in indices indicate missing values. These values are set to fill_value. Any other other negative values raise a ValueError.

fill_valueany, optional

Fill value to use for NA-indices when allow_fill is True. This may be None, in which case the default NA value for the type, self.dtype.na_value, is used. For many ExtensionArrays, there will be two representations of fill_value: a user-facing “boxed” scalar, and a low-level physical NA value. fill_value should be the user-facing version, and the implementation should handle translating that to the physical version for processing the take if nescessary.

Returns

ExtensionArray

Raises

IndexError: When the indices are out of bounds for the array.
ValueError: When indices contains negative values other than -1 and allow_fill is True.

See also

numpy.take
pandas.api.extensions.take

Notes

ExtensionArray.take is called by Series.__getitem__, .loc, iloc, when indices is a sequence of values. Additionally, it’s called by Series.reindex(), or any other method that causes realignemnt, with a fill_value.

class fletcher.FletcherChunkedDtype(arrow_dtype: pyarrow.lib.DataType)¶

Bases: fletcher.base.FletcherBaseDtype

Dtype for a pandas ExtensionArray backed by Apache Arrow’s pyarrow.ChunkedArray.

Attributes

itemsize
kind: Return a character code (one of ‘biufcmMOSUV’), default ‘O’.
name: Return a string identifying the data type.
names: Ordered list of field names, or None if there are no fields.
type: Return the scalar type for the array, e.g.

Methods

`construct_array_type`(*args)	Return the array type associated with this dtype.
`construct_from_string`(string)	Attempt to construct this type from a string.
`example`()	Get a simple array with example content.
`is_dtype`(dtype)	Check if we match ‘dtype’.

classmethod construct_array_type(*args) → Type[fletcher.base.FletcherChunkedArray]¶

Return the array type associated with this dtype.

Returns

type

classmethod construct_from_string(string: str) → fletcher.base.FletcherChunkedDtype ¶

Attempt to construct this type from a string.

Parameters

stringstr

Returns

selfinstance of ‘cls’

Raises

TypeError: If a class cannot be constructed from this ‘string’.

Examples

If the extension dtype can be constructed without any arguments, the following may be an adequate implementation. >>> @classmethod … def construct_from_string(cls, string) … if string == cls.name: … return cls() … else: … raise TypeError(“Cannot construct a ‘{}’ from ” … “’{}’”.format(cls, string))

class fletcher.FletcherContinuousArray(array, dtype=None, copy: Optional[bool] = None)¶

Bases: fletcher.base.FletcherBaseArray

Pandas ExtensionArray implementation backed by Apache Arrow’s pyarrow.Array.

Attributes

T
base: Return base object of the underlying data.
dtype: Return the ExtensionDtype of this array.
nbytes: Return the number of bytes needed to store this object in memory.
ndim: Return the number of dimensions of the underlying data.
shape: Return the shape of the data.
size: Return the number of elements in this array.

Methods

`all`([skipna])	Compute whether all boolean values are True.
`any`([skipna])	Compute whether any boolean value is True.
`argmax`()	Return the index of maximum value.
`argmin`()	Return the index of minimum value.
`argsort`([ascending, kind, na_position])	Return the indices that would sort this array.
`astype`(dtype[, copy])	Cast to a NumPy array with ‘dtype’.
`copy`()	Return a copy of the array.
`dropna`()	Return ExtensionArray without NA values.
`equals`(other)	Return if another array is equivalent to this array.
`factorize`([na_sentinel])	Encode the extension array as an enumerated type.
`fillna`([value, method, limit])	Fill NA/NaN values using the specified method.
`flatten`()	Flatten the array.
`isna`()	Boolean NumPy array indicating if each value is missing.
`ravel`([order])	Return a flattened view on this array.
`repeat`(repeats[, axis])	Repeat elements of a ExtensionArray.
`searchsorted`(value[, side, sorter])	Find indices where elements should be inserted to maintain order.
`shift`([periods, fill_value])	Shift values by desired number.
`sum`([skipna])	Return the sum of the values.
`take`(indices[, allow_fill, fill_value])	Take elements from an array.
`to_numpy`([dtype, copy, na_value])	Convert to a NumPy ndarray.
`transpose`(*axes)	Return a transposed view on this array.
`unique`()	Compute the ExtensionArray of unique values.
`value_counts`([dropna])	Return a Series containing counts of each unique value.
`view`([dtype])	Return a view on the array.

copy() → pandas.core.arrays.base.ExtensionArray¶

Return a copy of the array.

Currently is a shadow copy - pyarrow array are supposed to be immutable.

Returns

ExtensionArray

factorize(na_sentinel=- 1)¶

Encode the extension array as an enumerated type.

Parameters

na_sentinelint, default -1: Value to use in the codes array to indicate missing values.

Returns

codesndarray: An integer NumPy array that’s an indexer into the original ExtensionArray.
uniquesExtensionArray: An ExtensionArray containing the unique values of self.

Note

uniques will not contain an entry for the NA value of the ExtensionArray if there are any missing values present in self.

See also

factorize: Top-level factorize method that dispatches here.

Notes

pandas.factorize() offers a sort keyword as well.

fillna(value=None, method=None, limit=None)¶

Fill NA/NaN values using the specified method.

Parameters

valuescalar, array-like: If a scalar value is passed it is used to fill all missing values. Alternatively, an array-like ‘value’ can be given. It’s expected that the array-like have the same length as ‘self’.
method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None: Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap
limitint, default None: If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.

Returns

filledExtensionArray with NA/NaN filled

flatten()¶: Flatten the array.

property nbytes¶: Return the number of bytes needed to store this object in memory.

take(indices: Union[Sequence[int], numpy.ndarray], allow_fill: bool = False, fill_value: Optional[Any] = None) → pandas.core.arrays.base.ExtensionArray¶

Take elements from an array.

Parameters

indicessequence of integers

Indices to be taken.

allow_fillbool, default False

How to handle negative values in indices. * False: negative values in indices indicate positional indices

from the right (the default). This is similar to numpy.take().

True: negative values in indices indicate missing values. These values are set to fill_value. Any other other negative values raise a ValueError.

fill_valueany, optional

Fill value to use for NA-indices when allow_fill is True. This may be None, in which case the default NA value for the type, self.dtype.na_value, is used. For many ExtensionArrays, there will be two representations of fill_value: a user-facing “boxed” scalar, and a low-level physical NA value. fill_value should be the user-facing version, and the implementation should handle translating that to the physical version for processing the take if nescessary.

Returns

ExtensionArray

Raises

IndexError: When the indices are out of bounds for the array.
ValueError: When indices contains negative values other than -1 and allow_fill is True.

See also

numpy.take
pandas.api.extensions.take

Notes

ExtensionArray.take is called by Series.__getitem__, .loc, iloc, when indices is a sequence of values. Additionally, it’s called by Series.reindex(), or any other method that causes realignemnt, with a fill_value.

class fletcher.FletcherContinuousDtype(arrow_dtype: pyarrow.lib.DataType)¶

Bases: fletcher.base.FletcherBaseDtype

Dtype for a pandas ExtensionArray backed by Apache Arrow’s pyarrow.Array.

Attributes

itemsize
kind: Return a character code (one of ‘biufcmMOSUV’), default ‘O’.
name: Return a string identifying the data type.
names: Ordered list of field names, or None if there are no fields.
type: Return the scalar type for the array, e.g.

Methods

`construct_array_type`(*args)	Return the array type associated with this dtype.
`construct_from_string`(string)	Attempt to construct this type from a string.
`example`()	Get a simple array with example content.
`is_dtype`(dtype)	Check if we match ‘dtype’.

classmethod construct_array_type(*args)¶

Return the array type associated with this dtype.

Returns

type

classmethod construct_from_string(string: str)¶

Attempt to construct this type from a string.

Parameters

string

Returns

selfinstance of ‘cls’

Raises

TypeError: If a class cannot be constructed from this ‘string’.

Examples

If the extension dtype can be constructed without any arguments, the following may be an adequate implementation. >>> @classmethod … def construct_from_string(cls, string) … if string == cls.name: … return cls() … else: … raise TypeError(“Cannot construct a ‘{}’ from ” … “’{}’”.format(cls, string))

class fletcher.TextAccessor(obj)¶

Bases: fletcher.string_array.TextAccessorBase

Accessor for pandas exposed as .fr_strx.

Methods

`cat`(others)	Concatenate strings in the Series/Index with given separator.
`contains`(pat[, case, regex])	Test if pattern or regex is contained within a string of a Series or Index.
`endswith`(pat)	Check whether a row ends with a certain pattern.
`replace`(pat, repl[, n, case, regex])	Replace occurrences of pattern/regex in the Series/Index with some other string.
`slice`([start, end, step])	Extract every step character from strings from start to end.
`startswith`(pat)	Check whether a row starts with a certain pattern.
`strip`([to_strip])	Strip whitespaces from both ends of strings.
`zfill`(width)	Pad strings in the Series/Index by prepending ‘0’ characters.

count
isalnum
isalpha
isdecimal
isdigit
islower
isnumeric
isspace
istitle
isupper

cat(others: Optional[fletcher.base.FletcherBaseArray]) → pandas.core.series.Series¶

Concatenate strings in the Series/Index with given separator.

If others is specified, this function concatenates the Series/Index and elements of others element-wise. If others is not passed, then all values in the Series/Index are concatenated into a single string with a given sep.

contains(pat: str, case: bool = True, regex: bool = True) → pandas.core.series.Series¶

Test if pattern or regex is contained within a string of a Series or Index.

Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.

This implementation differs to the one in pandas:

We always return a missing for missing data.
You cannot pass flags for the regular expression module.

Parameters

patstr

Character sequence or regular expression.

casebool, default True

If True, case sensitive.

regexbool, default True

If True, assumes the pat is a regular expression.

If False, treats the pat as a literal string.

Returns

Series or Index of boolean values: A Series or Index of boolean values indicating whether the given pattern is contained within the string of each element of the Series or Index.

count(pat: str, regex: bool = True) → pandas.core.series.Series¶

endswith(pat)¶: Check whether a row ends with a certain pattern.

isalnum()¶

isalpha()¶

isdecimal()¶

isdigit()¶

islower()¶

isnumeric()¶

isspace()¶

istitle()¶

isupper()¶

replace(pat: str, repl: str, n: int = - 1, case: bool = True, regex: bool = True)¶

Replace occurrences of pattern/regex in the Series/Index with some other string. Equivalent to str.replace() or re.sub().

Return а string Series where in each row the occurrences of the given pattern or regex pat are replaced by repl.

This implementation differs to the one in pandas:

We always return a missing for missing data.
You cannot pass flags for the regular expression module.

Parameters

patstr: Character sequence or regular expression.
replstr: Replacement string.
nint: Number of replacements to make from start.
casebool, default True: If True, case sensitive.
regexbool, default True: If True, assumes the pat is a regular expression. If False, treats the pat as a literal string.

Returns

Series of string values.

slice(start=0, end=None, step=1)¶: Extract every step character from strings from start to end.

startswith(pat)¶: Check whether a row starts with a certain pattern.

strip(to_strip=None)¶: Strip whitespaces from both ends of strings.

zfill(width: int) → pandas.core.series.Series¶: Pad strings in the Series/Index by prepending ‘0’ characters.

fletcher.pandas_from_arrow(arrow_object: Union[pyarrow.lib.RecordBatch, pyarrow.lib.Table, pyarrow.lib.Array, pyarrow.lib.ChunkedArray], continuous: bool = False)¶

Convert Arrow object instance to their Pandas equivalent by using Fletcher.

The conversion rules are:

{RecordBatch, Table} -> DataFrame
{Array, ChunkedArray} -> Series

Parameters

arrow_objectRecordBatch, Table, Array or ChunkedArray: object to be converted
continuousbool: Use FletcherContinuousArray instead of FletcherChunkedArray

fletcher.read_parquet(path, columns: Optional[List[str]] = None, continuous: bool = False) → pandas.core.frame.DataFrame¶

Load a parquet object from the file path, returning a DataFrame with fletcher columns.

Parameters

pathstr or file-like
continuousbool: Use FletcherContinuousArray instead of FletcherChunkedArray

Returns

pd.DataFrame