Gain Statistical Insights into Your DataTable¶

Woodwork provides methods on DataTable to allow users to utilize the typing information inherent in a DataTable to better understand their data.

Let’s walk through how to use describe and get_mutual_information on a retail DataTable so that we can see the full capabilities of the functions.

[1]:

import pandas as pd
from woodwork import DataTable
from woodwork.demo import load_retail

dt = load_retail()
dt.types

[1]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
order_product_id	category	Categorical	{index}
order_id	category	Categorical	{category}
product_id	category	Categorical	{category}
description	string	NaturalLanguage	{}
quantity	Int64	WholeNumber	{numeric}
order_date	datetime64[ns]	Datetime	{time_index}
unit_price	float64	Double	{numeric}
customer_name	category	Categorical	{category}
country	category	Categorical	{category}
total	float64	Double	{numeric}
cancelled	boolean	Boolean	{}

DataTable.describe¶

We use dt.describe() to calculate statistics for the Data Columns in a DataTable in the format of a Pandas DataFrame with the relevant calculations done for each Data Column.

[2]:

dt.describe()

[2]:

	order_id	product_id	description	quantity	order_date	unit_price	customer_name	country	total	cancelled
physical_type	category	category	string	Int64	datetime64[ns]	float64	category	category	float64	boolean
logical_type	Categorical	Categorical	NaturalLanguage	WholeNumber	Datetime	Double	Categorical	Categorical	Double	Boolean
semantic_tags	{category}	{category}	{}	{numeric}	{time_index}	{numeric}	{category}	{category}	{numeric}	{}
count	401604	401604	401604	401604	401604	401604	401604	401604	401604	401604
nunique	22190	3684	NaN	436	20460	620	4372	37	3952	NaN
nan_count	0	0	0	0	0	0	0	0	0	0
mean	NaN	NaN	NaN	12.1833	2011-07-10 12:08:23.848567552	5.73221	NaN	NaN	34.0125	NaN
mode	576339	85123A	WHITE HANGING HEART T-LIGHT HOLDER	1	2011-11-14 15:27:00	2.0625	Mary Dalton	United Kingdom	24.75	False
std	NaN	NaN	NaN	250.283	NaN	115.111	NaN	NaN	710.081	NaN
min	NaN	NaN	NaN	-80995	2010-12-01 08:26:00	0	NaN	NaN	-277975	NaN
first_quartile	NaN	NaN	NaN	2	NaN	2.0625	NaN	NaN	7.0125	NaN
second_quartile	NaN	NaN	NaN	5	NaN	3.2175	NaN	NaN	19.305	NaN
third_quartile	NaN	NaN	NaN	12	NaN	6.1875	NaN	NaN	32.67	NaN
max	NaN	NaN	NaN	80995	2011-12-09 12:50:00	64300.5	NaN	NaN	277975	NaN
num_true	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8872
num_false	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	392732

There are a couple things to note in the above dataframe:

The DataTable’s index, order_product_id, is not included
We provide each Data Column’s typing information according to Woodwork’s typing system
Any statistic that cannot be calculated for a Data Column, say num_false on a Datetime, will be filled with NaN.
Null values would not get counted in any of the calculations other than nunique

DataTable.value_counts¶

We can use dt.value_counts() to calculate the most frequent values for each Data Columns that has category as a standard tag. This returns a dictionary where each Data Column is associated with a sorted list of dictionaries. Each dictionary contains value and count.

[3]:

dt.value_counts()

[3]:

{'order_product_id': [{'value': 401603, 'count': 1},
  {'value': 133859, 'count': 1},
  {'value': 133861, 'count': 1},
  {'value': 133862, 'count': 1},
  {'value': 133863, 'count': 1},
  {'value': 133864, 'count': 1},
  {'value': 133865, 'count': 1},
  {'value': 133866, 'count': 1},
  {'value': 133867, 'count': 1},
  {'value': 133868, 'count': 1}],
 'order_id': [{'value': '576339', 'count': 542},
  {'value': '579196', 'count': 533},
  {'value': '580727', 'count': 529},
  {'value': '578270', 'count': 442},
  {'value': '573576', 'count': 435},
  {'value': '567656', 'count': 421},
  {'value': '567183', 'count': 392},
  {'value': '575607', 'count': 377},
  {'value': '571441', 'count': 364},
  {'value': '570488', 'count': 353}],
 'product_id': [{'value': '85123A', 'count': 2065},
  {'value': '22423', 'count': 1894},
  {'value': '85099B', 'count': 1659},
  {'value': '47566', 'count': 1409},
  {'value': '84879', 'count': 1405},
  {'value': '20725', 'count': 1346},
  {'value': '22720', 'count': 1224},
  {'value': 'POST', 'count': 1196},
  {'value': '22197', 'count': 1110},
  {'value': '23203', 'count': 1108}],
 'customer_name': [{'value': 'Mary Dalton', 'count': 7812},
  {'value': 'Dalton Grant', 'count': 5898},
  {'value': 'Jeremy Woods', 'count': 5128},
  {'value': 'Jasmine Salazar', 'count': 4459},
  {'value': 'James Robinson', 'count': 2759},
  {'value': 'Bryce Stewart', 'count': 2478},
  {'value': 'Vanessa Sanchez', 'count': 2085},
  {'value': 'Laura Church', 'count': 1853},
  {'value': 'Kelly Alvarado', 'count': 1667},
  {'value': 'Ashley Meyer', 'count': 1640}],
 'country': [{'value': 'United Kingdom', 'count': 356728},
  {'value': 'Germany', 'count': 9480},
  {'value': 'France', 'count': 8475},
  {'value': 'EIRE', 'count': 7475},
  {'value': 'Spain', 'count': 2528},
  {'value': 'Netherlands', 'count': 2371},
  {'value': 'Belgium', 'count': 2069},
  {'value': 'Switzerland', 'count': 1877},
  {'value': 'Portugal', 'count': 1471},
  {'value': 'Australia', 'count': 1258}]}

DataTable.get_mutual_information()¶

dt.get_mutual_information will calculate the mutual information between all pairs of relevant Data Columns. Certain types such as datetimes or strings cannot have mutual information calculated.

The mutual information between columns A and B can be understood as the amount of knowlege we can have about column A if we have the values of column B. The more mutual information there is between A and B, the less uncertainty there is in A knowing B or vice versa.

If we call dt.get_mutual_information(), we’ll see that order_date will be excluded from the resulting dataframe.

[4]:

dt.get_mutual_information()

[4]:

	column_1	column_2	mutual_info
3	order_id	customer_name	0.886411
0	order_id	product_id	0.475745
8	product_id	unit_price	0.426383
9	product_id	customer_name	0.361855
16	quantity	total	0.184497
22	customer_name	country	0.155593
11	product_id	total	0.152183
5	order_id	total	0.129882
4	order_id	country	0.126048
1	order_id	quantity	0.114714
20	unit_price	total	0.103210
23	customer_name	total	0.099530
7	product_id	quantity	0.088663
14	quantity	customer_name	0.085515
13	quantity	unit_price	0.082515
2	order_id	unit_price	0.077681
27	total	cancelled	0.044032
18	unit_price	customer_name	0.041308
17	quantity	cancelled	0.035528
10	product_id	country	0.028569
25	country	total	0.025071
6	order_id	cancelled	0.022204
15	quantity	country	0.021515
24	customer_name	cancelled	0.006456
12	product_id	cancelled	0.003769
26	country	cancelled	0.003607
19	unit_price	country	0.002603
21	unit_price	cancelled	0.001677

Available Parameters¶

dt.get_mutual_information provides two parameters for tuning the mutual information calculation.

num_bins - In order to calculate mutual information on continuous data, we bin numeric data into categories. This parameter allows users to choose the number of bins with which to categorize data.
- Defaults to using 10 bins
- The more bins there are, the more variety a column will have. The number of bins used should accurately portray the spread of the data.
nrows - If nrows is set at a value below the number of rows in the DataTable, that number of rows will be randomly sampled from the underlying data
- Defaults to using all the available rows.
- Decreasing the number of rows can speed up the mutual information calculation on a DataTable with many rows, though care should be taken that the number being sampled is large enough to accurately portray the data.

Now we’ll explore changing the number of bins. Note that this will only impact numeric Data Columns quantity and unit_price. We’re going to increase the number of bins from 10 to 50, only showing the impacted columns.

[5]:

mi = dt.get_mutual_information()
mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]

[5]:

	column_1	column_2	mutual_info
8	product_id	unit_price	0.426383
16	quantity	total	0.184497
1	order_id	quantity	0.114714
20	unit_price	total	0.103210
7	product_id	quantity	0.088663
14	quantity	customer_name	0.085515
13	quantity	unit_price	0.082515
2	order_id	unit_price	0.077681
18	unit_price	customer_name	0.041308
17	quantity	cancelled	0.035528
15	quantity	country	0.021515
19	unit_price	country	0.002603
21	unit_price	cancelled	0.001677

[6]:

mi = dt.get_mutual_information(num_bins = 50)
mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]

[6]:

	column_1	column_2	mutual_info
8	product_id	unit_price	0.528865
20	unit_price	total	0.405555
16	quantity	total	0.349243
1	order_id	quantity	0.157188
7	product_id	quantity	0.143938
2	order_id	unit_price	0.140257
14	quantity	customer_name	0.113431
13	quantity	unit_price	0.105052
17	quantity	cancelled	0.081334
18	unit_price	customer_name	0.078942
15	quantity	country	0.023758
19	unit_price	country	0.006311
21	unit_price	cancelled	0.001671

Woodwork Global Configuration Options Using Woodwork with Dask DataFrames