Woodwork provides methods on DataTable to allow users to utilize the typing information inherent in a DataTable to better understand their data.
Let’s walk through how to use describe and get_mutual_information on a retail DataTable so that we can see the full capabilities of the functions.
describe
get_mutual_information
[1]:
import pandas as pd from woodwork import DataTable from woodwork.demo import load_retail dt = load_retail() dt.types
We use dt.describe() to calculate statistics for the Data Columns in a DataTable in the format of a Pandas DataFrame with the relevant calculations done for each Data Column.
dt.describe()
[2]:
There are a couple things to note in the above dataframe:
The DataTable’s index, order_product_id, is not included
order_product_id
We provide each Data Column’s typing information according to Woodwork’s typing system
Any statistic that cannot be calculated for a Data Column, say num_false on a Datetime, will be filled with NaN.
num_false
Datetime
NaN
Null values would not get counted in any of the calculations other than nunique
nunique
dt.get_mutual_information will calculate the mutual information between all pairs of relevant Data Columns. Certain types such as datetimes or strings cannot have mutual information calculated.
dt.get_mutual_information
The mutual information between columns A and B can be understood as the amount of knowlege we can have about column A if we have the values of column B. The more mutual information there is between A and B, the less uncertainty there is in A knowing B or vice versa.
A
B
If we call dt.get_mutual_information(), we’ll see that order_date will be excluded from the resulting dataframe.
dt.get_mutual_information()
order_date
[3]:
dt.get_mutual_information provides two parameters for tuning the mutual information calculation.
num_bins - In order to calculate mutual information on continuous data, we bin numeric data into categories. This parameter allows users to choose the number of bins with which to categorize data.
num_bins
Defaults to using 10 bins
The more bins there are, the more variety a column will have. The number of bins used should accurately portray the spread of the data.
nrows - If nrows is set at a value below the number of rows in the DataTable, that number of rows will be randomly sampled from the underlying data
nrows
Defaults to using all the available rows.
Decreasing the number of rows can speed up the mutual information calculation on a DataTable with many rows, though care should be taken that the number being sampled is large enough to accurately portray the data.
Now we’ll explore changing the number of bins. Note that this will only impact numeric Data Columns quantity and unit_price. We’re going to increase the number of bins from 10 to 50, only showing the impacted columns.
quantity
unit_price
[4]:
mi = dt.get_mutual_information() mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]
[5]:
mi = dt.get_mutual_information(num_bins = 50) mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]