woodwork.table_accessor.WoodworkTableAccessor.dependence#

WoodworkTableAccessor.dependence(measures='all', num_bins=10, nrows=None, include_index=False, include_time_index=False, callback=None, extra_stats=False, min_shared=25, random_seed=0, max_nunique=6000, target_col=None)[source]#

Calculates dependence measures between all pairs of columns in the DataFrame that support measuring dependence. Supports boolean, categorical, datetime, and numeric data. Call woodwork.utils.get_valid_mi_types and woodwork.utils.get_valid_pearson_types for complete lists of supported Logical Types.

Parameters:

dataframe (pd.DataFrame) – Data containing Woodwork typing information from which to calculate dependence.
measures (list or str) –
Which dependence measures to calculate. A list of measures can be provided to calculate multiple measures at once. Valid measure strings:
- ”pearson”: calculates the Pearson correlation coefficient
- ”mutual_info”: calculates the mutual information between columns
- ”spearman”: calculates the Spearman correlation coefficient
- ”max”: max(abs(pearson), abs(spearman), mutual) for each pair of columns
- ”all”: includes columns for “pearson”, “mutual_info”, “spearman”, and “max”
num_bins (int) – Determines number of bins to use for converting numeric features into categorical. Defaults to 10. Pearson calculation does not use binning.
nrows (int) – The number of rows to sample for when determining dependence. If specified, samples the desired number of rows from the data. Defaults to using all rows.
include_index (bool) – If True, the column specified as the index will be included as long as its LogicalType is valid for measuring dependence. If False, the index column will not be considered. Defaults to False.
include_time_index (bool) – If True, the column specified as the time index will be included for measuring dependence. If False, the time index column will not be considered. Defaults to False.
callback (callable, optional) –
Function to be called with incremental updates. Has the following parameters:
- update (int): change in progress since last call
- progress (int): the progress so far in the calculations
- total (int): the total number of calculations to do
- unit (str): unit of measurement for progress/total
- time_elapsed (float): total time in seconds elapsed since start of call
extra_stats (bool) – If True, additional column “shared_rows” recording the number of shared non-null rows for a column pair will be included with the dataframe. Defaults to False. If the “max” measure is being used, a “measure_used” column will be added that records whether Pearson or mutual information was the maximum dependence for a particular row.
min_shared (int) – The number of shared non-null rows needed to calculate. Less rows than this will be considered too sparse to measure accurately and will return a NaN value. Must be non-negative. Defaults to 25.
random_seed (int) – Seed for the random number generator. Defaults to 0.
max_nunique (int) – The maximum number of unique values for large categorical columns (> 800 unique values). Categorical columns will be dropped until this number is met or until there is only one large categorical column. Defaults to 6000.
target_col (str) – The column name of the target. If provided, will only calculate the dependence dictionary between other columns and this target column. The target column will be column_2 in the returned result. Defaults to None.

Returns:

A DataFrame with the columns column_1, column_2, and keys for the specified dependence measures. The rows are sorted in decending order by the first specified measure. Dependence information values are between 0 (no dependence) and 1 (perfect dependency). For Pearson and Spearman, values range from -1 to 1 but 0 is still no dependence. Additional columns will be included if the extra_stats is True.

Return type:

pd.DataFrame