qufe.pdhandler module

pandas DataFrame utility functions for data analysis and manipulation.

This module provides utilities for: - Converting data types within DataFrames - Analyzing column structures across multiple DataFrames - Finding and extracting rows/columns with missing or empty data - Data quality validation and exploration - Integer allocation using mathematical methods - Value rebalancing to eliminate negative values - Proportional integer distribution across groups

Required dependencies:: pip install qufe[data]

This installs: pandas>=1.1.0, numpy>=1.17.0

class qufe.pdhandler.PandasHandler(default_exclude_cols: List[str] | None = None)[source]

Bases: object

pandas DataFrame utility handler for data analysis and manipulation.

Provides methods for converting data types, analyzing column structures, finding missing data, data quality validation, and mathematical allocation.

Parameters:: default_exclude_cols – Default columns to exclude from NA/empty checks. Can be overridden in individual method calls.
Raises:: ImportError – If pandas is not installed

Example

>>> handler = PandasHandler(default_exclude_cols=['id', 'created_at'])
>>> result = handler.convert_list_to_tuple_in_df(df)

__init__(default_exclude_cols: List[str] | None = None)[source]: Initialize PandasHandler with dependency validation.

allocate_integer_proportional(df, group_cols: str | List[str], total_col: str, weight_col: str, result_col: str = 'allocated', keep_intermediate: bool = False, int_dtype: str = 'int64') → object[source]

Distribute group totals as integers proportional to weights using Hamilton’s Method.

Each group distributes its total value (assumed identical across rows in the group) as integers proportional to weight values, ensuring the sum of allocated integers equals the group total exactly.

This method implements Hamilton’s Method (Largest Remainder Method), which: 1. Calculates proportional float values based on weights 2. Takes floor of each value as base allocation 3. Distributes remaining units to items with largest fractional remainders

Parameters:

df – Input DataFrame containing groups, totals, and weights
group_cols – Column(s) defining groups for distribution (str or list of str)
total_col – Column containing group total (must be identical within each group)
weight_col – Column containing positive weights for proportional distribution
result_col – Name for the result column containing integer allocations. Defaults to ‘allocated’.
keep_intermediate – If True, keep intermediate calculation columns for debugging (_raw, _base, _remainder, _rank_in_group, _remaining, _increment). Defaults to False.
int_dtype – Integer dtype for result column (‘int64’ or ‘int32’). Use ‘int32’ for memory savings when total values are below 2^31. Defaults to ‘int64’.

Returns:

DataFrame with proportional integer allocations in the result column

Raises:

TypeError – If input is not a pandas DataFrame
KeyError – If specified columns don’t exist in DataFrame

Notes

Assumes total_col values are identical within each group
Assumes weight_col contains positive values (or at least group sum > 0)
Uses Hamilton’s Method for fair integer distribution
Result sum per group equals total_col value exactly

Example

>>> handler = PandasHandler()
>>> # Single group column
>>> data = pd.DataFrame({
...     'group': ['A', 'A', 'A', 'B', 'B'],
...     'total': [100, 100, 100, 50, 50],
...     'weight': [30, 50, 20, 40, 60]
... })
>>> result = handler.allocate_integer_proportional(
...     data, 'group', 'total', 'weight'
... )
>>> print(result[['group', 'total', 'weight', 'allocated']])
>>> # Verify sum per group equals total
>>> print(result.groupby('group')['allocated'].sum())

>>> # Multiple group columns
>>> data = pd.DataFrame({
...     'region': ['East', 'East', 'West', 'West'],
...     'category': ['A', 'B', 'A', 'B'],
...     'total': [100, 100, 50, 50],
...     'weight': [60, 40, 30, 70]
... })
>>> result = handler.allocate_integer_proportional(
...     data, ['region', 'category'], 'total', 'weight'
... )

allocate_integer_remainder(df, group_cols: str | List[str], value_col: str, result_col: str = 'allocated', keep_intermediate: bool = False) → object[source]

Allocate fractional values to integers using the Largest Remainder Method.

Also known as Hamilton’s Method or Hare-Niemeyer method, this is a standard mathematical approach for proportional integer allocation while preserving group totals.

The method works by: 1. Taking the floor of each fractional value as the base allocation 2. Distributing remaining units to items with largest fractional parts 3. Ensuring the sum of allocated integers equals the floor of group sum

Parameters:

df – Input DataFrame containing the data to allocate
group_cols – Column(s) to group by for allocation (str or list of str)
value_col – Column containing fractional values to be allocated
result_col – Name for the result column containing integer allocations. Defaults to ‘allocated’.
keep_intermediate – If True, keep intermediate calculation columns. Defaults to False.

Returns:

DataFrame with integer allocations in the result column

Raises:

TypeError – If input is not a pandas DataFrame
KeyError – If specified columns don’t exist in DataFrame

References

Example

>>> handler = PandasHandler()
>>> data = pd.DataFrame({
...     'group': ['A', 'A', 'B', 'B'],
...     'value': [2.7, 1.8, 3.3, 2.2]
... })
>>> result = handler.allocate_integer_remainder(data, 'group', 'value')
>>> print(result[['group', 'value', 'allocated']])

calculate_cumulative_balance(df, initial_col: str, inflow_col: str, outflow_col: str, segment_marker_col: str, result_col: str = 'balance', group_cols: str | List[str] | None = None, keep_intermediate: bool = False, replace_initial_col: bool = True) → object[source]

Calculate cumulative balance with segment-based reinitialization.

Computes running balance where each segment (marked by a flag column) starts with an initial value and accumulates inflows minus outflows. This is commonly used for tracking any cumulative flow with periodic resets.

The calculation follows these rules: - When segment marker is 1: balance = initial + inflow - outflow - When segment marker is 0: balance = previous_balance + inflow - outflow - Each group (if specified) is calculated independently

This method uses vectorized operations for efficient computation on large datasets, avoiding explicit loops through clever use of cumulative sums and groupby operations.

Parameters:

df – Input DataFrame with flow data (must be pre-sorted)
initial_col – Column containing initial values for each segment start
inflow_col – Column containing inflow amounts (additions)
outflow_col – Column containing outflow amounts (subtractions)
segment_marker_col – Column with 0/1 values marking segment starts (1=start)
result_col – Name for the result column. Defaults to ‘balance’.
group_cols – Optional column(s) to group by for independent calculations. Can be string or list of strings. Defaults to None.
keep_intermediate – If True, keep intermediate calculation columns. Defaults to False.
replace_initial_col – If True, replace initial column with previous balance column.

Returns:

DataFrame with cumulative balance in the result column

Raises:

TypeError – If input is not a pandas DataFrame
KeyError – If specified columns don’t exist in DataFrame
ValueError – If segment_marker_col contains values other than 0 or 1

Notes

DataFrame must be pre-sorted in the desired order
Segment markers must be 0 or 1 (1 indicates segment start)
Each group’s first row should have segment_marker = 1
Uses pandas cumsum and groupby for vectorized computation

Example

>>> handler = PandasHandler()
>>> # Simple cumulative flow tracking
>>> data = pd.DataFrame({
...     'initial': [100, 0, 0, 150, 0],
...     'inflow': [10, 20, 15, 10, 25],
...     'outflow': [5, 30, 10, 20, 15],
...     'new_segment': [1, 0, 0, 1, 0]
... })
>>> result = handler.calculate_cumulative_balance(
...     data,
...     initial_col='initial',
...     inflow_col='inflow',
...     outflow_col='outflow',
...     segment_marker_col='new_segment'
... )
>>> print(result[['balance']])

>>> # Multi-entity tracking with groups
>>> data = pd.DataFrame({
...     'entity': ['A', 'A', 'A', 'B', 'B', 'B'],
...     'period': [1, 2, 3, 1, 2, 3],
...     'initial': [100, 0, 0, 200, 0, 0],
...     'additions': [50, 30, 40, 60, 70, 80],
...     'deductions': [20, 45, 35, 50, 65, 75],
...     'period_start': [1, 0, 0, 1, 0, 0]
... })
>>> result = handler.calculate_cumulative_balance(
...     data,
...     initial_col='initial',
...     inflow_col='additions',
...     outflow_col='deductions',
...     segment_marker_col='period_start',
...     group_cols='entity'
... )

Algorithm Details:

The method uses a segment-based approach where: 1. Net flow (delta) = inflow - outflow 2. Segments are identified by cumulative sum of markers 3. Initial value propagates through each segment 4. Cumulative delta within segments gives final balance

This achieves O(n) complexity without explicit loops.

convert_list_to_tuple_in_df(df) → object[source]

Convert list values to tuples in DataFrame object columns.

Preserves None values and other data types unchanged. Only processes columns with object dtype that contain list values.

Parameters:: df – Input DataFrame to process (pandas.DataFrame)
Returns:: DataFrame with list values converted to tuples
Raises:: TypeError – If input is not a pandas DataFrame

Example

>>> handler = PandasHandler()
>>> df = pd.DataFrame({'col1': [[1, 2], [3, 4]], 'col2': ['a', 'b']})
>>> result = handler.convert_list_to_tuple_in_df(df)
>>> print(result['col1'].iloc[0])
(1, 2)

help() → None[source]

Display help information for pandas DataFrame utilities.

Shows installation instructions, available methods, and usage examples.

rebalance_negative_values(df, group_cols: str | List[str], value_col: str, result_col: str = 'rebalanced', use_offset: bool = False, keep_intermediate: bool = False) → object[source]

Redistribute values within groups to eliminate negatives while preserving totals.

This function redistributes group totals to eliminate negative values while maintaining proportional relationships among positive values. It implements a weighted redistribution algorithm with optional offset handling.

Parameters:

df – Input DataFrame containing values to rebalance
group_cols – Column(s) defining groups for rebalancing (str or list of str)
value_col – Column containing values to rebalance (may include negatives)
result_col – Name for the result column containing rebalanced values. Defaults to ‘rebalanced’.
use_offset – If True, use offset method to include negative values in weight calculation. If False, only positive values contribute to redistribution weights. Defaults to False.
keep_intermediate – If True, keep intermediate calculation columns. Defaults to False.

Returns:

DataFrame with rebalanced values in the result column

Raises:

TypeError – If input is not a pandas DataFrame
KeyError – If specified columns don’t exist in DataFrame

Notes

Groups with negative total sum cannot be rebalanced (values unchanged)
Groups where all values are equal receive equal distribution
The offset method shifts all values by the minimum to calculate weights

Example

>>> handler = PandasHandler()
>>> data = pd.DataFrame({
...     'group': ['A', 'A', 'A', 'B', 'B'],
...     'value': [10, -5, 15, 20, -10]
... })
>>> result = handler.rebalance_negative_values(data, 'group', 'value')
>>> print(result[['group', 'value', 'rebalanced']])

show_all_na(df) → object[source]

Extract rows and columns that contain NA values.

Returns a subset of the original DataFrame containing only: - Rows that have at least one NA value - Columns that have at least one NA value

Parameters:: df – Input DataFrame to analyze (pandas.DataFrame)
Returns:: Subset containing only rows and columns with NA values
Raises:: TypeError – If input is not a DataFrame

Example

>>> handler = PandasHandler()
>>> import numpy as np
>>> df = pd.DataFrame({'A': [1, np.nan], 'B': [3, 4], 'C': [np.nan, 6]})
>>> na_subset = handler.show_all_na(df)

show_all_na_or_empty_columns(df, exclude_cols: List[str] | None = None) → object[source]

Find columns containing NA values or empty strings.

Identifies columns that have NA values or empty strings (‘’) in any row, with option to exclude specific columns from the check.

Parameters:

df – Input DataFrame to analyze (pandas.DataFrame)
exclude_cols – Columns to exclude from NA/empty check. If None, uses default_exclude_cols from initialization.

Returns:

All rows, but only columns that contain NA values or empty strings

Raises:

TypeError – If input is not a DataFrame

Example

>>> handler = PandasHandler(default_exclude_cols=['id'])
>>> df = pd.DataFrame({'A': [1, 2], 'B': ['', 'x'], 'id': ['y', 'z']})
>>> problem_cols = handler.show_all_na_or_empty_columns(df)

show_all_na_or_empty_rows(df, exclude_cols: List[str] | None = None) → object[source]

Find rows containing NA values or empty strings.

Identifies rows that have NA values or empty strings (‘’) in any column, with option to exclude specific columns from the check.

Parameters:

df – Input DataFrame to analyze (pandas.DataFrame)
exclude_cols – Columns to exclude from NA/empty check. If None, uses default_exclude_cols from initialization.

Returns:

Rows containing NA values or empty strings, with all original columns

Raises:

TypeError – If input is not a DataFrame

Example

>>> handler = PandasHandler(default_exclude_cols=['id'])
>>> df = pd.DataFrame({'A': [1, ''], 'B': [3, 4], 'id': ['x', 'y']})
>>> problem_rows = handler.show_all_na_or_empty_rows(df)

show_col_names(dfs: List, print_result: bool = False) → Tuple[Dict[str, List[str]], object][source]

Compare column names across multiple DataFrames.

Creates a comprehensive view of all columns present in the input DataFrames, showing which columns exist in each DataFrame.

Parameters:

dfs – List of DataFrames to compare (List[pandas.DataFrame])
print_result – Whether to print the comparison table. Defaults to False.

Returns:

Dictionary mapping DataFrame names to column lists
Comparison DataFrame showing column presence across DataFrames

Return type:

Tuple containing

Raises:

TypeError – If input is not a list of DataFrames

Example

>>> handler = PandasHandler()
>>> df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
>>> df2 = pd.DataFrame({'B': [5, 6], 'C': [7, 8]})
>>> col_dict, comparison_df = handler.show_col_names([df1, df2])

verify_proportional_allocation(df, group_cols: str | List[str], total_col: str, result_col: str) → object[source]

Verify proportional allocation results.

Checks that the sum of allocated values per group equals the expected total for each group. Useful for validating results from allocate_integer_proportional.

Parameters:

df – DataFrame containing allocation results
group_cols – Column(s) defining groups (str or list of str)
total_col – Column containing expected group totals
result_col – Column containing allocated values to verify

Returns:

Group identifier(s)
Expected total (first value of total_col in group)
Actual total (sum of result_col in group)
Validity flag (True if expected == actual)

Return type:

DataFrame with verification results showing

Raises:

TypeError – If input is not a pandas DataFrame
KeyError – If specified columns don’t exist in DataFrame

Example

>>> handler = PandasHandler()
>>> # Single group column
>>> result = handler.allocate_integer_proportional(
...     df, 'group', 'total', 'weight'
... )
>>> verification = handler.verify_proportional_allocation(
...     result, 'group', 'total', 'allocated'
... )
>>> # Check for any mismatches
>>> if not verification['is_valid'].all():
...     print("Allocation errors found!")
...     print(verification[~verification['is_valid']])

>>> # Multiple group columns
>>> result = handler.allocate_integer_proportional(
...     df, ['region', 'category'], 'total', 'weight'
... )
>>> verification = handler.verify_proportional_allocation(
...     result, ['region', 'category'], 'total', 'allocated'
... )