Gain Statistical Insights into Your DataTable

Woodwork provides methods on DataTable to allow users to utilize the typing information inherent in a DataTable to better understand their data.

Let’s walk through how to use describe and get_mutual_information on a retail DataTable so that we can see the full capabilities of the functions.

[1]:
import pandas as pd
from woodwork import DataTable
from woodwork.demo import load_retail

dt = load_retail()
dt.types
[1]:
Physical Type Logical Type Semantic Tag(s)
Data Column
order_product_id category Categorical {index}
order_id category Categorical {category}
product_id category Categorical {category}
description string NaturalLanguage {}
quantity Int64 WholeNumber {numeric}
order_date datetime64[ns] Datetime {time_index}
unit_price float64 Double {numeric}
customer_name category Categorical {category}
country category Categorical {category}
total float64 Double {numeric}
cancelled boolean Boolean {}

DataTable.describe

We use dt.describe() to calculate statistics for the Data Columns in a DataTable in the format of a Pandas DataFrame with the relevant calculations done for each Data Column.

[2]:
dt.describe()
[2]:
order_id product_id description quantity order_date unit_price customer_name country total cancelled
physical_type category category string Int64 datetime64[ns] float64 category category float64 boolean
logical_type Categorical Categorical NaturalLanguage WholeNumber Datetime Double Categorical Categorical Double Boolean
semantic_tags {category} {category} {} {numeric} {time_index} {numeric} {category} {category} {numeric} {}
count 401604 401604 401604 401604 401604 401604 401604 401604 401604 401604
nunique 22190 3684 NaN 436 20460 620 4372 37 3952 NaN
nan_count 0 0 0 0 0 0 0 0 0 0
mean NaN NaN NaN 12.1833 2011-07-10 12:08:23.848567552 5.73221 NaN NaN 34.0125 NaN
mode 576339 85123A WHITE HANGING HEART T-LIGHT HOLDER 1 2011-11-14 15:27:00 2.0625 Mary Dalton United Kingdom 24.75 False
std NaN NaN NaN 250.283 NaN 115.111 NaN NaN 710.081 NaN
min NaN NaN NaN -80995 2010-12-01 08:26:00 0 NaN NaN -277975 NaN
first_quartile NaN NaN NaN 2 NaN 2.0625 NaN NaN 7.0125 NaN
second_quartile NaN NaN NaN 5 NaN 3.2175 NaN NaN 19.305 NaN
third_quartile NaN NaN NaN 12 NaN 6.1875 NaN NaN 32.67 NaN
max NaN NaN NaN 80995 2011-12-09 12:50:00 64300.5 NaN NaN 277975 NaN
num_true NaN NaN NaN NaN NaN NaN NaN NaN NaN 8872
num_false NaN NaN NaN NaN NaN NaN NaN NaN NaN 392732

There are a couple things to note in the above dataframe:

  • The DataTable’s index, order_product_id, is not included

  • We provide each Data Column’s typing information according to Woodwork’s typing system

  • Any statistic that cannot be calculated for a Data Column, say num_false on a Datetime, will be filled with NaN.

  • Null values would not get counted in any of the calculations other than nunique

DataTable.get_mutual_information()

dt.get_mutual_information will calculate the mutual information between all pairs of relevant Data Columns. Certain types such as datetimes or strings cannot have mutual information calculated.

The mutual information between columns A and B can be understood as the amount of knowlege we can have about column A if we have the values of column B. The more mutual information there is between A and B, the less uncertainty there is in A knowing B or vice versa.

If we call dt.get_mutual_information(), we’ll see that order_date will be excluded from the resulting dataframe.

[3]:
dt.get_mutual_information()
[3]:
column_1 column_2 mutual_info
16 order_id customer_name 0.886411
7 product_id order_id 0.475745
0 unit_price product_id 0.426383
11 product_id customer_name 0.361855
22 quantity total 0.184497
27 customer_name country 0.155593
10 product_id total 0.152183
15 order_id total 0.129882
17 order_id country 0.126048
14 order_id quantity 0.114714
4 unit_price total 0.103210
25 total customer_name 0.099530
9 product_id quantity 0.088663
23 quantity customer_name 0.085515
3 unit_price quantity 0.082515
1 unit_price order_id 0.077681
19 cancelled total 0.044032
5 unit_price customer_name 0.041308
18 cancelled quantity 0.035528
12 product_id country 0.028569
26 total country 0.025071
13 order_id cancelled 0.022204
24 quantity country 0.021515
20 cancelled customer_name 0.006456
8 product_id cancelled 0.003769
21 cancelled country 0.003607
6 unit_price country 0.002603
2 unit_price cancelled 0.001677

Available Parameters

dt.get_mutual_information provides two parameters for tuning the mutual information calculation.

  • num_bins - In order to calculate mutual information on continuous data, we bin numeric data into categories. This parameter allows users to choose the number of bins with which to categorize data.

    • Defaults to using 10 bins

    • The more bins there are, the more variety a column will have. The number of bins used should accurately portray the spread of the data.

  • nrows - If nrows is set at a value below the number of rows in the DataTable, that number of rows will be randomly sampled from the underlying data

    • Defaults to using all the available rows.

    • Decreasing the number of rows can speed up the mutual information calculation on a DataTable with many rows, though care should be taken that the number being sampled is large enough to accurately portray the data.

Now we’ll explore changing the number of bins. Note that this will only impact numeric Data Columns quantity and unit_price. We’re going to increase the number of bins from 10 to 50, only showing the impacted columns.

[4]:
mi = dt.get_mutual_information()
mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]
[4]:
column_1 column_2 mutual_info
0 unit_price product_id 0.426383
22 quantity total 0.184497
14 order_id quantity 0.114714
4 unit_price total 0.103210
9 product_id quantity 0.088663
23 quantity customer_name 0.085515
3 unit_price quantity 0.082515
1 unit_price order_id 0.077681
5 unit_price customer_name 0.041308
18 cancelled quantity 0.035528
24 quantity country 0.021515
6 unit_price country 0.002603
2 unit_price cancelled 0.001677
[5]:
mi = dt.get_mutual_information(num_bins = 50)
mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]
[5]:
column_1 column_2 mutual_info
0 unit_price product_id 0.528865
4 unit_price total 0.405555
22 quantity total 0.349243
14 order_id quantity 0.157188
9 product_id quantity 0.143938
1 unit_price order_id 0.140257
23 quantity customer_name 0.113431
3 unit_price quantity 0.105052
18 cancelled quantity 0.081334
5 unit_price customer_name 0.078942
24 quantity country 0.023758
6 unit_price country 0.006311
2 unit_price cancelled 0.001671