Woodwork provides methods on DataTable to allow you to use the typing information inherent in a DataTable to better understand your data.
Follow along to learn how to use describe and mutual_information on a retail DataTable so that you can see the full capabilities of the functions.
describe
mutual_information
[1]:
import pandas as pd from woodwork import DataTable from woodwork.demo import load_retail dt = load_retail() dt
Use dt.describe() to calculate statistics for the DataColumns in a DataTable in the format of a pandas DataFrame with the relevant calculations done for each DataColumn.
dt.describe()
[2]:
There are a couple things to note in the above dataframe:
The DataTable’s index, order_product_id, is not included
order_product_id
We provide each DataColumn’s typing information according to Woodwork’s typing system
Any statistics that can’t be calculated for a DataColumn, say num_false on a Datetime are filled with NaN.
num_false
Datetime
NaN
Null values do not get counted in any of the calculations other than nunique
nunique
Use dt.value_counts() to calculate the most frequent values for each Data Columns that has category as a standard tag. This returns a dictionary where each DataColumn is associated with a sorted list of dictionaries. Each dictionary contains value and count.
dt.value_counts()
category
value
count
[3]:
{'order_product_id': [{'value': 401603, 'count': 1}, {'value': 133859, 'count': 1}, {'value': 133861, 'count': 1}, {'value': 133862, 'count': 1}, {'value': 133863, 'count': 1}, {'value': 133864, 'count': 1}, {'value': 133865, 'count': 1}, {'value': 133866, 'count': 1}, {'value': 133867, 'count': 1}, {'value': 133868, 'count': 1}], 'order_id': [{'value': '576339', 'count': 542}, {'value': '579196', 'count': 533}, {'value': '580727', 'count': 529}, {'value': '578270', 'count': 442}, {'value': '573576', 'count': 435}, {'value': '567656', 'count': 421}, {'value': '567183', 'count': 392}, {'value': '575607', 'count': 377}, {'value': '571441', 'count': 364}, {'value': '570488', 'count': 353}], 'product_id': [{'value': '85123A', 'count': 2065}, {'value': '22423', 'count': 1894}, {'value': '85099B', 'count': 1659}, {'value': '47566', 'count': 1409}, {'value': '84879', 'count': 1405}, {'value': '20725', 'count': 1346}, {'value': '22720', 'count': 1224}, {'value': 'POST', 'count': 1196}, {'value': '22197', 'count': 1110}, {'value': '23203', 'count': 1108}], 'customer_name': [{'value': 'Mary Dalton', 'count': 7812}, {'value': 'Dalton Grant', 'count': 5898}, {'value': 'Jeremy Woods', 'count': 5128}, {'value': 'Jasmine Salazar', 'count': 4459}, {'value': 'James Robinson', 'count': 2759}, {'value': 'Bryce Stewart', 'count': 2478}, {'value': 'Vanessa Sanchez', 'count': 2085}, {'value': 'Laura Church', 'count': 1853}, {'value': 'Kelly Alvarado', 'count': 1667}, {'value': 'Ashley Meyer', 'count': 1640}], 'country': [{'value': 'United Kingdom', 'count': 356728}, {'value': 'Germany', 'count': 9480}, {'value': 'France', 'count': 8475}, {'value': 'EIRE', 'count': 7475}, {'value': 'Spain', 'count': 2528}, {'value': 'Netherlands', 'count': 2371}, {'value': 'Belgium', 'count': 2069}, {'value': 'Switzerland', 'count': 1877}, {'value': 'Portugal', 'count': 1471}, {'value': 'Australia', 'count': 1258}]}
dt.mutual_information calculates the mutual information between all pairs of relevant DataColumns. Certain types, like strings, can’t have mutual information calculated.
dt.mutual_information
The mutual information between columns A and B can be understood as the amount of knowledge you can have about column A if you have the values of column B. The more mutual information there is between A and B, the less uncertainty there is in A knowing B, and vice versa.
A
B
[4]:
dt.mutual_information()
dt.mutual_information provides two parameters for tuning the mutual information calculation.
num_bins - In order to calculate mutual information on continuous data, Woodwork bins numeric data into categories. This parameter allows you to choose the number of bins with which to categorize data.
num_bins
Defaults to using 10 bins
The more bins there are, the more variety a column will have. The number of bins used should accurately portray the spread of the data.
nrows - If nrows is set at a value below the number of rows in the DataTable, that number of rows is randomly sampled from the underlying data
nrows
Defaults to using all the available rows.
Decreasing the number of rows can speed up the mutual information calculation on a DataTable with many rows, but you should be careful that the number being sampled is large enough to accurately portray the data.
Now that you understand the parameters, you can explore changing the number of bins. Note—this only affects numeric Data Columns quantity and unit_price. Increase the number of bins from 10 to 50, only showing the impacted columns.
quantity
unit_price
[5]:
mi = dt.mutual_information() mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]
[6]:
mi = dt.mutual_information(num_bins = 50) mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]