Gain Statistical Insights into Your Data¶

Woodwork provides methods on your DataFrames to allow you to use the typing information stored by Woodwork to better understand your data.

Follow along to learn how to use Woodwork’s statistical methods on a DataFrame of retail data while demonstrating the full capabilities of the functions.

[1]:

import pandas as pd
import numpy as np
from woodwork.demo import load_retail

df = load_retail()
df.ww

[1]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
order_product_id	category	Categorical	['index']
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	string	NaturalLanguage	[]
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	['time_index']
unit_price	float64	Double	['numeric']
customer_name	category	Categorical	['category']
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	bool	Boolean	[]

DataFrame.ww.describe¶

Use df.ww.describe() to calculate statistics for the columns in a DataFrame, returning the results in the format of a pandas DataFrame with the relevant calculations done for each column.

[2]:

df.ww.describe()

[2]:

	order_id	product_id	description	quantity	order_date	unit_price	customer_name	country	total	cancelled
physical_type	category	category	string	int64	datetime64[ns]	float64	category	category	float64	bool
logical_type	Categorical	Categorical	NaturalLanguage	Integer	Datetime	Double	Categorical	Categorical	Double	Boolean
semantic_tags	{category}	{category}	{}	{numeric}	{time_index}	{numeric}	{category}	{category}	{numeric}	{}
count	401604	401604	401604	401604.0	401604	401604.0	401604	401604	401604.0	401604
nunique	22190	3684	NaN	436.0	20460	620.0	4372	37	3946.0	NaN
nan_count	0	0	0	0	0	0	0	0	0	0
mean	NaN	NaN	NaN	12.183273	2011-07-10 12:08:23.848567552	5.732205	NaN	NaN	34.012502	NaN
mode	576339	85123A	WHITE HANGING HEART T-LIGHT HOLDER	1	2011-11-14 15:27:00	2.0625	Mary Dalton	United Kingdom	24.75	False
std	NaN	NaN	NaN	250.283037	NaN	115.110658	NaN	NaN	710.081161	NaN
min	NaN	NaN	NaN	-80995.0	2010-12-01 08:26:00	0.0	NaN	NaN	-277974.84	NaN
first_quartile	NaN	NaN	NaN	2.0	NaN	2.0625	NaN	NaN	7.0125	NaN
second_quartile	NaN	NaN	NaN	5.0	NaN	3.2175	NaN	NaN	19.305	NaN
third_quartile	NaN	NaN	NaN	12.0	NaN	6.1875	NaN	NaN	32.67	NaN
max	NaN	NaN	NaN	80995.0	2011-12-09 12:50:00	64300.5	NaN	NaN	277974.84	NaN
num_true	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8872
num_false	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	392732

There are a couple things to note in the above dataframe:

The Woodwork index, order_product_id, is not included
We provide each column’s typing information according to Woodwork’s typing system
Any statistics that can’t be calculated for a column, such as num_false on a Datetime are filled with NaN.
Null values do not get counted in any of the calculations other than nunique

DataFrame.ww.value_counts¶

Use df.ww.value_counts() to calculate the most frequent values for each column that has category as a standard tag. This returns a dictionary where each column is associated with a sorted list of dictionaries. Each dictionary contains value and count.

[3]:

df.ww.value_counts()

[3]:

{'order_product_id': [{'value': 0, 'count': 1},
  {'value': 267744, 'count': 1},
  {'value': 267742, 'count': 1},
  {'value': 267741, 'count': 1},
  {'value': 267740, 'count': 1},
  {'value': 267739, 'count': 1},
  {'value': 267738, 'count': 1},
  {'value': 267737, 'count': 1},
  {'value': 267736, 'count': 1},
  {'value': 267735, 'count': 1}],
 'order_id': [{'value': '576339', 'count': 542},
  {'value': '579196', 'count': 533},
  {'value': '580727', 'count': 529},
  {'value': '578270', 'count': 442},
  {'value': '573576', 'count': 435},
  {'value': '567656', 'count': 421},
  {'value': '567183', 'count': 392},
  {'value': '575607', 'count': 377},
  {'value': '571441', 'count': 364},
  {'value': '570488', 'count': 353}],
 'product_id': [{'value': '85123A', 'count': 2065},
  {'value': '22423', 'count': 1894},
  {'value': '85099B', 'count': 1659},
  {'value': '47566', 'count': 1409},
  {'value': '84879', 'count': 1405},
  {'value': '20725', 'count': 1346},
  {'value': '22720', 'count': 1224},
  {'value': 'POST', 'count': 1196},
  {'value': '22197', 'count': 1110},
  {'value': '23203', 'count': 1108}],
 'customer_name': [{'value': 'Mary Dalton', 'count': 7812},
  {'value': 'Dalton Grant', 'count': 5898},
  {'value': 'Jeremy Woods', 'count': 5128},
  {'value': 'Jasmine Salazar', 'count': 4459},
  {'value': 'James Robinson', 'count': 2759},
  {'value': 'Bryce Stewart', 'count': 2478},
  {'value': 'Vanessa Sanchez', 'count': 2085},
  {'value': 'Laura Church', 'count': 1853},
  {'value': 'Kelly Alvarado', 'count': 1667},
  {'value': 'Ashley Meyer', 'count': 1640}],
 'country': [{'value': 'United Kingdom', 'count': 356728},
  {'value': 'Germany', 'count': 9480},
  {'value': 'France', 'count': 8475},
  {'value': 'EIRE', 'count': 7475},
  {'value': 'Spain', 'count': 2528},
  {'value': 'Netherlands', 'count': 2371},
  {'value': 'Belgium', 'count': 2069},
  {'value': 'Switzerland', 'count': 1877},
  {'value': 'Portugal', 'count': 1471},
  {'value': 'Australia', 'count': 1258}]}

DataFrame.ww.mutual_information¶

df.ww.mutual_information calculates the mutual information between all pairs of relevant columns. Certain types, like strings, can’t have mutual information calculated.

The mutual information between columns A and B can be understood as the amount of knowledge you can have about column A if you have the values of column B. The more mutual information there is between A and B, the less uncertainty there is in A knowing B, and vice versa.

[4]:

df.ww.mutual_information()

[4]:

	column_1	column_2	mutual_info
0	order_id	customer_name	0.886411
1	order_id	product_id	0.475745
2	product_id	unit_price	0.426383
3	order_id	order_date	0.391906
4	product_id	customer_name	0.361855
5	order_date	customer_name	0.187982
6	quantity	total	0.184497
7	customer_name	country	0.155593
8	product_id	total	0.152183
9	order_id	total	0.129882
10	order_id	country	0.126048
11	order_id	quantity	0.114714
12	unit_price	total	0.103210
13	customer_name	total	0.099530
14	product_id	quantity	0.088663
15	quantity	customer_name	0.085515
16	quantity	unit_price	0.082515
17	order_id	unit_price	0.077681
18	product_id	order_date	0.057175
19	total	cancelled	0.044032
20	unit_price	customer_name	0.041308
21	quantity	cancelled	0.035528
22	product_id	country	0.028569
23	country	total	0.025071
24	order_id	cancelled	0.022204
25	quantity	country	0.021515
26	order_date	country	0.010361
27	customer_name	cancelled	0.006456
28	product_id	cancelled	0.003769
29	country	cancelled	0.003607
30	order_date	unit_price	0.003180
31	order_date	total	0.002625
32	unit_price	country	0.002603
33	quantity	order_date	0.002146
34	unit_price	cancelled	0.001677
35	order_date	cancelled	0.000199

Available Parameters¶

df.ww.mutual_information provides various parameters for tuning the mutual information calculation.

num_bins - In order to calculate mutual information on continuous data, Woodwork bins numeric data into categories. This parameter allows you to choose the number of bins with which to categorize data.
- Defaults to using 10 bins
- The more bins there are, the more variety a column will have. The number of bins used should accurately portray the spread of the data.
nrows - If nrows is set at a value below the number of rows in the DataFrame, that number of rows is randomly sampled from the underlying data
- Defaults to using all the available rows.
- Decreasing the number of rows can speed up the mutual information calculation on a DataFrame with many rows, but you should be careful that the number being sampled is large enough to accurately portray the data.
include_index - If set to True and an index is defined with a logical type that is valid for mutual information, the index column will be included in the mutual information output.
- Defaults to False

Now that you understand the parameters, you can explore changing the number of bins. Note—this only affects numeric columns quantity and unit_price. Increase the number of bins from 10 to 50, only showing the impacted columns.

[5]:

mi = df.ww. mutual_information()
mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]

[5]:

	column_1	column_2	mutual_info
2	product_id	unit_price	0.426383
6	quantity	total	0.184497
11	order_id	quantity	0.114714
12	unit_price	total	0.103210
14	product_id	quantity	0.088663
15	quantity	customer_name	0.085515
16	quantity	unit_price	0.082515
17	order_id	unit_price	0.077681
20	unit_price	customer_name	0.041308
21	quantity	cancelled	0.035528
25	quantity	country	0.021515
30	order_date	unit_price	0.003180
32	unit_price	country	0.002603
33	quantity	order_date	0.002146
34	unit_price	cancelled	0.001677

[6]:

mi = df.ww.mutual_information(num_bins = 50)
mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]

[6]:

	column_1	column_2	mutual_info
2	product_id	unit_price	0.528865
4	unit_price	total	0.405555
7	quantity	total	0.349243
10	order_id	quantity	0.157188
13	product_id	quantity	0.143938
14	order_id	unit_price	0.140257
16	quantity	customer_name	0.113431
17	quantity	unit_price	0.105052
18	quantity	cancelled	0.081334
19	unit_price	customer_name	0.078942
24	quantity	country	0.023758
27	order_date	unit_price	0.011905
30	unit_price	country	0.006311
31	quantity	order_date	0.004170
34	unit_price	cancelled	0.001671

In order to include the index column in the mutual information output, run the calculation with include_index=True.

[7]:

mi = df.ww.mutual_information(include_index=True)
mi[mi['column_1'].isin(['order_product_id']) | mi['column_2'].isin(['order_product_id'])]

[7]:

	column_1	column_2	mutual_info
1	order_product_id	order_id	0.845419
2	order_product_id	customer_name	0.736457
3	order_product_id	product_id	0.732680
8	order_product_id	order_date	0.302856
9	order_product_id	total	0.302435
10	order_product_id	unit_price	0.299486
11	order_product_id	quantity	0.266489
21	order_product_id	country	0.093880
34	order_product_id	cancelled	0.016307

Outlier Detection with Series.ww.box_plot_dict¶

Woodwork allows for univariate outlier detection using the IQR, or interquartile range, method. This can be done on a by-column basis using the series.ww.box_plot_dict method that identifies outliers and includes the statistical data necessary for building a box and whisker plot.

[8]:

total = df.ww['total']
box_plot_dict = total.ww.box_plot_dict()

print('high bound: ', box_plot_dict['high_bound'])
print('low_bound: ', box_plot_dict['low_bound'])
print('quantiles: ', box_plot_dict['quantiles'])
print('number of low outliers: ', len(box_plot_dict['low_values']))
print('number of high outliers: ', len(box_plot_dict['high_values']))

high bound:  71.15625
low_bound:  -31.473750000000003
quantiles:  {0.0: -277974.84, 0.25: 7.0124999999999975, 0.5: 19.305, 0.75: 32.669999999999995, 1.0: 277974.84}
number of low outliers:  1922
number of high outliers:  31016

We can see that the total column in the retail dataset has many outliers, and they are skewed more towards the top of the dataset. There are around 400K rows in the dataframe, so about 8% of all values are outliers. Let’s also look at a normally distributed column of data of the same length and see what the statistics generated for it look like.

[9]:

rnd = np.random.RandomState(33)
s = pd.Series(rnd.normal(50, 10, 401604))
s.ww.init()
box_plot_dict = s.ww.box_plot_dict()
print('high bound: ', box_plot_dict['high_bound'])
print('low_bound: ', box_plot_dict['low_bound'])
print('quantiles: ', box_plot_dict['quantiles'])
print('number of low outliers: ', len(box_plot_dict['low_values']))
print('number of high outliers: ', len(box_plot_dict['high_values']))

high bound:  77.04098129791424
low_bound:  22.897953681834856
quantiles:  {0.0: 4.519658918840335, 0.25: 43.20158903786463, 0.5: 49.988236390934304, 0.75: 56.73734594188448, 1.0: 95.28094989391388}
number of low outliers:  1460
number of high outliers:  1381

With the normally distributed set of data, the likelyhood of outliers is closer to what we’d expect, around .7%.

Global Configuration Options

Using Woodwork with Dask and Koalas DataFrames