woodwork.table_accessor.WoodworkTableAccessor.describe_dict#

WoodworkTableAccessor.describe_dict(include: Sequence[Union[str, LogicalType]] = None, callback: Callable[[int, int, int, str, float], Any] = None, results_callback: Callable[[DataFrame, Series], Any] = None, extra_stats: bool = False, bins: int = 10, top_x: int = 10, recent_x: int = 10) → Dict[str, dict][source]#

Calculates statistics for data contained in the DataFrame.

Parameters:

include (list[str or LogicalType], optional) – filter for what columns to include in the statistics returned. Can be a list of column names, semantic tags, logical types, or a list combining any of the three. It follows the most broad specification. Favors logical types then semantic tag then column name. If no matching columns are found, an empty DataFrame will be returned.
callback (callable, optional) –
Function to be called with incremental updates. Has the following parameters:
- update (int): change in progress since last call
- progress (int): the progress so far in the calculations
- total (int): the total number of calculations to do
- unit (str): unit of measurement for progress/total
- time_elapsed (float): total time in seconds elapsed since start of call
results_callback (callable, optional) –
function to be called with intermediate results. Has the following parameters:
- results_so_far (pd.DataFrame): the full dataframe calculated so far
- most_recent_calculation (pd.Series): the calculations for the most recent column
extra_stats (bool) – If True, will calculate a histogram for numeric columns, top values for categorical columns and value counts for the most recent values in datetime columns. Will also calculate value counts within the range of values present for integer columns if the range of values present is less than or equal to than the number of bins used to compute the histogram. Output can be controlled by bins, top_x and recent_x parameters.
bins (int) – Number of bins to use when calculating histogram for numeric columns. Defaults to 10. Will be ignored unless extra_stats=True.
top_x (int) – Number of items to return when getting the most frequently occurring values for categorical columns. Defaults to 10. Will be ignored unless extra_stats=True.
recent_x (int) – Number of values to return when calculating value counts for the most recent dates in datetime columns. Defaults to 10. Will be ignored unless extra_stats=True.

Returns:

A dictionary with a key for each column in the data or for each column matching the logical types, semantic tags or column names specified in include, paired with a value containing a dictionary containing relevant statistics for that column.

Return type:

Dict[str -> dict]