Gain Statistical Insights into Your Data¶
Woodwork provides methods on your DataFrames to allow you to use the typing information stored by Woodwork to better understand your data.
Follow along to learn how to use Woodwork’s statistical methods on a DataFrame of retail data while demonstrating the full capabilities of the functions.
[1]:
import pandas as pd
import numpy as np
from woodwork.demo import load_retail
df = load_retail()
df.ww
[1]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
order_product_id | category | Categorical | ['index'] |
order_id | category | Categorical | ['category'] |
product_id | category | Categorical | ['category'] |
description | string | NaturalLanguage | [] |
quantity | int64 | Integer | ['numeric'] |
order_date | datetime64[ns] | Datetime | ['time_index'] |
unit_price | float64 | Double | ['numeric'] |
customer_name | category | Categorical | ['category'] |
country | category | Categorical | ['category'] |
total | float64 | Double | ['numeric'] |
cancelled | bool | Boolean | [] |
DataFrame.ww.describe¶
Use df.ww.describe()
to calculate statistics for the columns in a DataFrame, returning the results in the format of a pandas DataFrame with the relevant calculations done for each column. Note, that both nan
and (nan, nan)
values contribute to nan_count
for LatLong logical types
[2]:
df.ww.describe()
[2]:
order_id | product_id | description | quantity | order_date | unit_price | customer_name | country | total | cancelled | |
---|---|---|---|---|---|---|---|---|---|---|
physical_type | category | category | string | int64 | datetime64[ns] | float64 | category | category | float64 | bool |
logical_type | Categorical | Categorical | NaturalLanguage | Integer | Datetime | Double | Categorical | Categorical | Double | Boolean |
semantic_tags | {category} | {category} | {} | {numeric} | {time_index} | {numeric} | {category} | {category} | {numeric} | {} |
count | 401604 | 401604 | 401604 | 401604.0 | 401604 | 401604.0 | 401604 | 401604 | 401604.0 | 401604 |
nunique | 22190 | 3684 | NaN | 436.0 | 20460 | 620.0 | 4372 | 37 | 3946.0 | NaN |
nan_count | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
mean | NaN | NaN | NaN | 12.183273 | 2011-07-10 12:08:23.848567552 | 5.732205 | NaN | NaN | 34.012502 | NaN |
mode | 576339 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 1 | 2011-11-14 15:27:00 | 2.0625 | Mary Dalton | United Kingdom | 24.75 | False |
std | NaN | NaN | NaN | 250.283037 | NaN | 115.110658 | NaN | NaN | 710.081161 | NaN |
min | NaN | NaN | NaN | -80995.0 | 2010-12-01 08:26:00 | 0.0 | NaN | NaN | -277974.84 | NaN |
first_quartile | NaN | NaN | NaN | 2.0 | NaN | 2.0625 | NaN | NaN | 7.0125 | NaN |
second_quartile | NaN | NaN | NaN | 5.0 | NaN | 3.2175 | NaN | NaN | 19.305 | NaN |
third_quartile | NaN | NaN | NaN | 12.0 | NaN | 6.1875 | NaN | NaN | 32.67 | NaN |
max | NaN | NaN | NaN | 80995.0 | 2011-12-09 12:50:00 | 64300.5 | NaN | NaN | 277974.84 | NaN |
num_true | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8872 |
num_false | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 392732 |
There are a couple things to note in the above dataframe:
The Woodwork index,
order_product_id
, is not includedWe provide each column’s typing information according to Woodwork’s typing system
Any statistics that can’t be calculated for a column, such as
num_false
on aDatetime
are filled withNaN
.Null values do not get counted in any of the calculations other than
nunique
DataFrame.ww.value_counts¶
Use df.ww.value_counts()
to calculate the most frequent values for each column that has category
as a standard tag. This returns a dictionary where each column is associated with a sorted list of dictionaries. Each dictionary contains value
and count
.
[3]:
df.ww.value_counts()
[3]:
{'order_product_id': [{'value': 0, 'count': 1},
{'value': 267744, 'count': 1},
{'value': 267742, 'count': 1},
{'value': 267741, 'count': 1},
{'value': 267740, 'count': 1},
{'value': 267739, 'count': 1},
{'value': 267738, 'count': 1},
{'value': 267737, 'count': 1},
{'value': 267736, 'count': 1},
{'value': 267735, 'count': 1}],
'order_id': [{'value': '576339', 'count': 542},
{'value': '579196', 'count': 533},
{'value': '580727', 'count': 529},
{'value': '578270', 'count': 442},
{'value': '573576', 'count': 435},
{'value': '567656', 'count': 421},
{'value': '567183', 'count': 392},
{'value': '575607', 'count': 377},
{'value': '571441', 'count': 364},
{'value': '570488', 'count': 353}],
'product_id': [{'value': '85123A', 'count': 2065},
{'value': '22423', 'count': 1894},
{'value': '85099B', 'count': 1659},
{'value': '47566', 'count': 1409},
{'value': '84879', 'count': 1405},
{'value': '20725', 'count': 1346},
{'value': '22720', 'count': 1224},
{'value': 'POST', 'count': 1196},
{'value': '22197', 'count': 1110},
{'value': '23203', 'count': 1108}],
'customer_name': [{'value': 'Mary Dalton', 'count': 7812},
{'value': 'Dalton Grant', 'count': 5898},
{'value': 'Jeremy Woods', 'count': 5128},
{'value': 'Jasmine Salazar', 'count': 4459},
{'value': 'James Robinson', 'count': 2759},
{'value': 'Bryce Stewart', 'count': 2478},
{'value': 'Vanessa Sanchez', 'count': 2085},
{'value': 'Laura Church', 'count': 1853},
{'value': 'Kelly Alvarado', 'count': 1667},
{'value': 'Ashley Meyer', 'count': 1640}],
'country': [{'value': 'United Kingdom', 'count': 356728},
{'value': 'Germany', 'count': 9480},
{'value': 'France', 'count': 8475},
{'value': 'EIRE', 'count': 7475},
{'value': 'Spain', 'count': 2528},
{'value': 'Netherlands', 'count': 2371},
{'value': 'Belgium', 'count': 2069},
{'value': 'Switzerland', 'count': 1877},
{'value': 'Portugal', 'count': 1471},
{'value': 'Australia', 'count': 1258}]}
DataFrame.ww.mutual_information¶
df.ww.mutual_information
calculates the mutual information between all pairs of relevant columns. Certain types, like strings, can’t have mutual information calculated.
The mutual information between columns A
and B
can be understood as the amount of knowledge you can have about column A
if you have the values of column B
. The more mutual information there is between A
and B
, the less uncertainty there is in A
knowing B
, and vice versa.
[4]:
df.ww.mutual_information()
[4]:
column_1 | column_2 | mutual_info | |
---|---|---|---|
0 | order_id | customer_name | 0.886411 |
1 | order_id | product_id | 0.475745 |
2 | product_id | unit_price | 0.426383 |
3 | order_id | order_date | 0.391906 |
4 | product_id | customer_name | 0.361855 |
5 | order_date | customer_name | 0.187982 |
6 | quantity | total | 0.184497 |
7 | customer_name | country | 0.155593 |
8 | product_id | total | 0.152183 |
9 | order_id | total | 0.129882 |
10 | order_id | country | 0.126048 |
11 | order_id | quantity | 0.114714 |
12 | unit_price | total | 0.103210 |
13 | customer_name | total | 0.099530 |
14 | product_id | quantity | 0.088663 |
15 | quantity | customer_name | 0.085515 |
16 | quantity | unit_price | 0.082515 |
17 | order_id | unit_price | 0.077681 |
18 | product_id | order_date | 0.057175 |
19 | total | cancelled | 0.044032 |
20 | unit_price | customer_name | 0.041308 |
21 | quantity | cancelled | 0.035528 |
22 | product_id | country | 0.028569 |
23 | country | total | 0.025071 |
24 | order_id | cancelled | 0.022204 |
25 | quantity | country | 0.021515 |
26 | order_date | country | 0.010361 |
27 | customer_name | cancelled | 0.006456 |
28 | product_id | cancelled | 0.003769 |
29 | country | cancelled | 0.003607 |
30 | order_date | unit_price | 0.003180 |
31 | order_date | total | 0.002625 |
32 | unit_price | country | 0.002603 |
33 | quantity | order_date | 0.002146 |
34 | unit_price | cancelled | 0.001677 |
35 | order_date | cancelled | 0.000199 |
Available Parameters¶
df.ww.mutual_information
provides various parameters for tuning the mutual information calculation.
num_bins
- In order to calculate mutual information on continuous data, Woodwork bins numeric data into categories. This parameter allows you to choose the number of bins with which to categorize data.Defaults to using 10 bins
The more bins there are, the more variety a column will have. The number of bins used should accurately portray the spread of the data.
nrows
- Ifnrows
is set at a value below the number of rows in the DataFrame, that number of rows is randomly sampled from the underlying dataDefaults to using all the available rows.
Decreasing the number of rows can speed up the mutual information calculation on a DataFrame with many rows, but you should be careful that the number being sampled is large enough to accurately portray the data.
include_index
- If set toTrue
and an index is defined with a logical type that is valid for mutual information, the index column will be included in the mutual information output.Defaults to
False
Now that you understand the parameters, you can explore changing the number of bins. Note—this only affects numeric columns quantity
and unit_price
. Increase the number of bins from 10 to 50, only showing the impacted columns.
[5]:
mi = df.ww. mutual_information()
mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]
[5]:
column_1 | column_2 | mutual_info | |
---|---|---|---|
2 | product_id | unit_price | 0.426383 |
6 | quantity | total | 0.184497 |
11 | order_id | quantity | 0.114714 |
12 | unit_price | total | 0.103210 |
14 | product_id | quantity | 0.088663 |
15 | quantity | customer_name | 0.085515 |
16 | quantity | unit_price | 0.082515 |
17 | order_id | unit_price | 0.077681 |
20 | unit_price | customer_name | 0.041308 |
21 | quantity | cancelled | 0.035528 |
25 | quantity | country | 0.021515 |
30 | order_date | unit_price | 0.003180 |
32 | unit_price | country | 0.002603 |
33 | quantity | order_date | 0.002146 |
34 | unit_price | cancelled | 0.001677 |
[6]:
mi = df.ww.mutual_information(num_bins = 50)
mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]
[6]:
column_1 | column_2 | mutual_info | |
---|---|---|---|
2 | product_id | unit_price | 0.528865 |
4 | unit_price | total | 0.405555 |
7 | quantity | total | 0.349243 |
10 | order_id | quantity | 0.157188 |
13 | product_id | quantity | 0.143938 |
14 | order_id | unit_price | 0.140257 |
16 | quantity | customer_name | 0.113431 |
17 | quantity | unit_price | 0.105052 |
18 | quantity | cancelled | 0.081334 |
19 | unit_price | customer_name | 0.078942 |
24 | quantity | country | 0.023758 |
27 | order_date | unit_price | 0.011905 |
30 | unit_price | country | 0.006311 |
31 | quantity | order_date | 0.004170 |
34 | unit_price | cancelled | 0.001671 |
In order to include the index column in the mutual information output, run the calculation with include_index=True
.
[7]:
mi = df.ww.mutual_information(include_index=True)
mi[mi['column_1'].isin(['order_product_id']) | mi['column_2'].isin(['order_product_id'])]
[7]:
column_1 | column_2 | mutual_info | |
---|---|---|---|
1 | order_product_id | order_id | 0.845419 |
2 | order_product_id | customer_name | 0.736457 |
3 | order_product_id | product_id | 0.732680 |
8 | order_product_id | order_date | 0.302856 |
9 | order_product_id | total | 0.302435 |
10 | order_product_id | unit_price | 0.299486 |
11 | order_product_id | quantity | 0.266489 |
21 | order_product_id | country | 0.093880 |
34 | order_product_id | cancelled | 0.016307 |
Outlier Detection with Series.ww.box_plot_dict¶
Woodwork allows for univariate outlier detection using the IQR, or interquartile range, method. This can be done on a by-column basis using the series.ww.box_plot_dict
method that identifies outliers and includes the statistical data necessary for building a box and whisker plot.
[8]:
total = df.ww['total']
box_plot_dict = total.ww.box_plot_dict()
print('high bound: ', box_plot_dict['high_bound'])
print('low_bound: ', box_plot_dict['low_bound'])
print('quantiles: ', box_plot_dict['quantiles'])
print('number of low outliers: ', len(box_plot_dict['low_values']))
print('number of high outliers: ', len(box_plot_dict['high_values']))
high bound: 71.15625
low_bound: -31.473750000000003
quantiles: {0.0: -277974.84, 0.25: 7.0124999999999975, 0.5: 19.305, 0.75: 32.669999999999995, 1.0: 277974.84}
number of low outliers: 1922
number of high outliers: 31016
We can see that the total
column in the retail dataset has many outliers, and they are skewed more towards the top of the dataset. There are around 400K rows in the dataframe, so about 8% of all values are outliers. Let’s also look at a normally distributed column of data of the same length and see what the statistics generated for it look like.
[9]:
rnd = np.random.RandomState(33)
s = pd.Series(rnd.normal(50, 10, 401604))
s.ww.init()
box_plot_dict = s.ww.box_plot_dict()
print('high bound: ', box_plot_dict['high_bound'])
print('low_bound: ', box_plot_dict['low_bound'])
print('quantiles: ', box_plot_dict['quantiles'])
print('number of low outliers: ', len(box_plot_dict['low_values']))
print('number of high outliers: ', len(box_plot_dict['high_values']))
high bound: 77.04098129791424
low_bound: 22.897953681834856
quantiles: {0.0: 4.519658918840335, 0.25: 43.20158903786463, 0.5: 49.988236390934304, 0.75: 56.73734594188448, 1.0: 95.28094989391388}
number of low outliers: 1460
number of high outliers: 1381
With the normally distributed set of data, the likelyhood of outliers is closer to what we’d expect, around .7%.