Woodwork provides methods on your DataFrames to allow you to use the typing information stored by Woodwork to better understand your data.
Follow along to learn how to use Woodwork’s statistical methods on a DataFrame of retail data while demonstrating the full capabilities of the functions.
[1]:
import pandas as pd from woodwork.demo import load_retail df = load_retail() df.ww
Use df.ww.describe() to calculate statistics for the columns in a DataFrame, returning the results in the format of a pandas DataFrame with the relevant calculations done for each column.
df.ww.describe()
[2]:
There are a couple things to note in the above dataframe:
The Woodwork index, order_product_id, is not included
order_product_id
We provide each column’s typing information according to Woodwork’s typing system
Any statistics that can’t be calculated for a column, such as num_false on a Datetime are filled with NaN.
num_false
Datetime
NaN
Null values do not get counted in any of the calculations other than nunique
nunique
Use df.ww.value_counts() to calculate the most frequent values for each column that has category as a standard tag. This returns a dictionary where each column is associated with a sorted list of dictionaries. Each dictionary contains value and count.
df.ww.value_counts()
category
value
count
[3]:
{'order_product_id': [{'value': 401603, 'count': 1}, {'value': 133859, 'count': 1}, {'value': 133861, 'count': 1}, {'value': 133862, 'count': 1}, {'value': 133863, 'count': 1}, {'value': 133864, 'count': 1}, {'value': 133865, 'count': 1}, {'value': 133866, 'count': 1}, {'value': 133867, 'count': 1}, {'value': 133868, 'count': 1}], 'order_id': [{'value': '576339', 'count': 542}, {'value': '579196', 'count': 533}, {'value': '580727', 'count': 529}, {'value': '578270', 'count': 442}, {'value': '573576', 'count': 435}, {'value': '567656', 'count': 421}, {'value': '567183', 'count': 392}, {'value': '575607', 'count': 377}, {'value': '571441', 'count': 364}, {'value': '570488', 'count': 353}], 'product_id': [{'value': '85123A', 'count': 2065}, {'value': '22423', 'count': 1894}, {'value': '85099B', 'count': 1659}, {'value': '47566', 'count': 1409}, {'value': '84879', 'count': 1405}, {'value': '20725', 'count': 1346}, {'value': '22720', 'count': 1224}, {'value': 'POST', 'count': 1196}, {'value': '22197', 'count': 1110}, {'value': '23203', 'count': 1108}], 'customer_name': [{'value': 'Mary Dalton', 'count': 7812}, {'value': 'Dalton Grant', 'count': 5898}, {'value': 'Jeremy Woods', 'count': 5128}, {'value': 'Jasmine Salazar', 'count': 4459}, {'value': 'James Robinson', 'count': 2759}, {'value': 'Bryce Stewart', 'count': 2478}, {'value': 'Vanessa Sanchez', 'count': 2085}, {'value': 'Laura Church', 'count': 1853}, {'value': 'Kelly Alvarado', 'count': 1667}, {'value': 'Ashley Meyer', 'count': 1640}], 'country': [{'value': 'United Kingdom', 'count': 356728}, {'value': 'Germany', 'count': 9480}, {'value': 'France', 'count': 8475}, {'value': 'EIRE', 'count': 7475}, {'value': 'Spain', 'count': 2528}, {'value': 'Netherlands', 'count': 2371}, {'value': 'Belgium', 'count': 2069}, {'value': 'Switzerland', 'count': 1877}, {'value': 'Portugal', 'count': 1471}, {'value': 'Australia', 'count': 1258}]}
df.ww.mutual_information calculates the mutual information between all pairs of relevant columns. Certain types, like strings, can’t have mutual information calculated.
df.ww.mutual_information
The mutual information between columns A and B can be understood as the amount of knowledge you can have about column A if you have the values of column B. The more mutual information there is between A and B, the less uncertainty there is in A knowing B, and vice versa.
A
B
[4]:
df.ww.mutual_information()
df.ww.mutual_information provides various parameters for tuning the mutual information calculation.
num_bins - In order to calculate mutual information on continuous data, Woodwork bins numeric data into categories. This parameter allows you to choose the number of bins with which to categorize data.
num_bins
Defaults to using 10 bins
The more bins there are, the more variety a column will have. The number of bins used should accurately portray the spread of the data.
nrows - If nrows is set at a value below the number of rows in the DataFrame, that number of rows is randomly sampled from the underlying data
nrows
Defaults to using all the available rows.
Decreasing the number of rows can speed up the mutual information calculation on a DataFrame with many rows, but you should be careful that the number being sampled is large enough to accurately portray the data.
include_index - If set to True and an index is defined with a logical type that is valid for mutual information, the index column will be included in the mutual information output.
include_index
True
Defaults to False
False
Now that you understand the parameters, you can explore changing the number of bins. Note—this only affects numeric columns quantity and unit_price. Increase the number of bins from 10 to 50, only showing the impacted columns.
quantity
unit_price
[5]:
mi = df.ww. mutual_information() mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]
[6]:
mi = df.ww.mutual_information(num_bins = 50) mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]
In order to include the index column in the mutual information output, run the calculation with include_index=True.
include_index=True
[7]:
mi = df.ww.mutual_information(include_index=True) mi[mi['column_1'].isin(['order_product_id']) | mi['column_2'].isin(['order_product_id'])]