Gain Statistical Insights into Your DataTable

Woodwork provides methods on DataTable to allow you to use the typing information inherent in a DataTable to better understand your data.

Follow along to learn how to use describe and mutual_information on a retail DataTable so that you can see the full capabilities of the functions.

[1]:
import pandas as pd
from woodwork import DataTable
from woodwork.demo import load_retail

dt = load_retail()
dt
[1]:
Physical Type Logical Type Semantic Tag(s)
Data Column
order_product_id category Categorical ['index']
order_id category Categorical ['category']
product_id category Categorical ['category']
description string NaturalLanguage []
quantity Int64 Integer ['numeric']
order_date datetime64[ns] Datetime ['time_index']
unit_price float64 Double ['numeric']
customer_name category Categorical ['category']
country category Categorical ['category']
total float64 Double ['numeric']
cancelled boolean Boolean []

DataTable.describe

Use dt.describe() to calculate statistics for the DataColumns in a DataTable in the format of a pandas DataFrame with the relevant calculations done for each DataColumn.

[2]:
dt.describe()
[2]:
order_id product_id description quantity order_date unit_price customer_name country total cancelled
physical_type category category string Int64 datetime64[ns] float64 category category float64 boolean
logical_type Categorical Categorical NaturalLanguage Integer Datetime Double Categorical Categorical Double Boolean
semantic_tags {category} {category} {} {numeric} {time_index} {numeric} {category} {category} {numeric} {}
count 401604 401604 401604 401604 401604 401604 401604 401604 401604 401604
nunique 22190 3684 NaN 436 20460 620 4372 37 3952 NaN
nan_count 0 0 0 0 0 0 0 0 0 0
mean NaN NaN NaN 12.1833 2011-07-10 12:08:23.848567552 5.73221 NaN NaN 34.0125 NaN
mode 576339 85123A WHITE HANGING HEART T-LIGHT HOLDER 1 2011-11-14 15:27:00 2.0625 Mary Dalton United Kingdom 24.75 False
std NaN NaN NaN 250.283 NaN 115.111 NaN NaN 710.081 NaN
min NaN NaN NaN -80995 2010-12-01 08:26:00 0 NaN NaN -277975 NaN
first_quartile NaN NaN NaN 2 NaN 2.0625 NaN NaN 7.0125 NaN
second_quartile NaN NaN NaN 5 NaN 3.2175 NaN NaN 19.305 NaN
third_quartile NaN NaN NaN 12 NaN 6.1875 NaN NaN 32.67 NaN
max NaN NaN NaN 80995 2011-12-09 12:50:00 64300.5 NaN NaN 277975 NaN
num_true NaN NaN NaN NaN NaN NaN NaN NaN NaN 8872
num_false NaN NaN NaN NaN NaN NaN NaN NaN NaN 392732

There are a couple things to note in the above dataframe:

  • The DataTable’s index, order_product_id, is not included

  • We provide each DataColumn’s typing information according to Woodwork’s typing system

  • Any statistics that can’t be calculated for a DataColumn, say num_false on a Datetime are filled with NaN.

  • Null values do not get counted in any of the calculations other than nunique

DataTable.value_counts

Use dt.value_counts() to calculate the most frequent values for each Data Columns that has category as a standard tag. This returns a dictionary where each DataColumn is associated with a sorted list of dictionaries. Each dictionary contains value and count.

[3]:
dt.value_counts()
[3]:
{'order_product_id': [{'value': 401603, 'count': 1},
  {'value': 133859, 'count': 1},
  {'value': 133861, 'count': 1},
  {'value': 133862, 'count': 1},
  {'value': 133863, 'count': 1},
  {'value': 133864, 'count': 1},
  {'value': 133865, 'count': 1},
  {'value': 133866, 'count': 1},
  {'value': 133867, 'count': 1},
  {'value': 133868, 'count': 1}],
 'order_id': [{'value': '576339', 'count': 542},
  {'value': '579196', 'count': 533},
  {'value': '580727', 'count': 529},
  {'value': '578270', 'count': 442},
  {'value': '573576', 'count': 435},
  {'value': '567656', 'count': 421},
  {'value': '567183', 'count': 392},
  {'value': '575607', 'count': 377},
  {'value': '571441', 'count': 364},
  {'value': '570488', 'count': 353}],
 'product_id': [{'value': '85123A', 'count': 2065},
  {'value': '22423', 'count': 1894},
  {'value': '85099B', 'count': 1659},
  {'value': '47566', 'count': 1409},
  {'value': '84879', 'count': 1405},
  {'value': '20725', 'count': 1346},
  {'value': '22720', 'count': 1224},
  {'value': 'POST', 'count': 1196},
  {'value': '22197', 'count': 1110},
  {'value': '23203', 'count': 1108}],
 'customer_name': [{'value': 'Mary Dalton', 'count': 7812},
  {'value': 'Dalton Grant', 'count': 5898},
  {'value': 'Jeremy Woods', 'count': 5128},
  {'value': 'Jasmine Salazar', 'count': 4459},
  {'value': 'James Robinson', 'count': 2759},
  {'value': 'Bryce Stewart', 'count': 2478},
  {'value': 'Vanessa Sanchez', 'count': 2085},
  {'value': 'Laura Church', 'count': 1853},
  {'value': 'Kelly Alvarado', 'count': 1667},
  {'value': 'Ashley Meyer', 'count': 1640}],
 'country': [{'value': 'United Kingdom', 'count': 356728},
  {'value': 'Germany', 'count': 9480},
  {'value': 'France', 'count': 8475},
  {'value': 'EIRE', 'count': 7475},
  {'value': 'Spain', 'count': 2528},
  {'value': 'Netherlands', 'count': 2371},
  {'value': 'Belgium', 'count': 2069},
  {'value': 'Switzerland', 'count': 1877},
  {'value': 'Portugal', 'count': 1471},
  {'value': 'Australia', 'count': 1258}]}

DataTable.mutual_information

dt.mutual_information calculates the mutual information between all pairs of relevant DataColumns. Certain types, like strings, can’t have mutual information calculated.

The mutual information between columns A and B can be understood as the amount of knowledge you can have about column A if you have the values of column B. The more mutual information there is between A and B, the less uncertainty there is in A knowing B, and vice versa.

[4]:
dt.mutual_information()
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:1000: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = new_col.cat.codes
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:992: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = pd.qcut(data[col_name], num_bins, duplicates="drop")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:1000: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = new_col.cat.codes
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:995: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = pd.qcut(data[col_name].astype('int64'), num_bins, duplicates="drop")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:1000: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = new_col.cat.codes
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:992: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = pd.qcut(data[col_name], num_bins, duplicates="drop")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:1000: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = new_col.cat.codes
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:992: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = pd.qcut(data[col_name], num_bins, duplicates="drop")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:1000: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = new_col.cat.codes
[4]:
column_1 column_2 mutual_info
0 order_id customer_name 0.886411
1 order_id product_id 0.475745
2 product_id unit_price 0.426383
3 order_id order_date 0.391906
4 product_id customer_name 0.361855
5 order_date customer_name 0.187982
6 quantity total 0.184497
7 customer_name country 0.155593
8 product_id total 0.152183
9 order_id total 0.129882
10 order_id country 0.126048
11 order_id quantity 0.114714
12 unit_price total 0.103210
13 customer_name total 0.099530
14 product_id quantity 0.088663
15 quantity customer_name 0.085515
16 quantity unit_price 0.082515
17 order_id unit_price 0.077681
18 product_id order_date 0.057175
19 total cancelled 0.044032
20 unit_price customer_name 0.041308
21 quantity cancelled 0.035528
22 product_id country 0.028569
23 country total 0.025071
24 order_id cancelled 0.022204
25 quantity country 0.021515
26 order_date country 0.010361
27 customer_name cancelled 0.006456
28 product_id cancelled 0.003769
29 country cancelled 0.003607
30 order_date unit_price 0.003180
31 order_date total 0.002625
32 unit_price country 0.002603
33 quantity order_date 0.002146
34 unit_price cancelled 0.001677
35 order_date cancelled 0.000199

Available Parameters

dt.mutual_information provides two parameters for tuning the mutual information calculation.

  • num_bins - In order to calculate mutual information on continuous data, Woodwork bins numeric data into categories. This parameter allows you to choose the number of bins with which to categorize data.

    • Defaults to using 10 bins

    • The more bins there are, the more variety a column will have. The number of bins used should accurately portray the spread of the data.

  • nrows - If nrows is set at a value below the number of rows in the DataTable, that number of rows is randomly sampled from the underlying data

    • Defaults to using all the available rows.

    • Decreasing the number of rows can speed up the mutual information calculation on a DataTable with many rows, but you should be careful that the number being sampled is large enough to accurately portray the data.

Now that you understand the parameters, you can explore changing the number of bins. Note—this only affects numeric Data Columns quantity and unit_price. Increase the number of bins from 10 to 50, only showing the impacted columns.

[5]:
mi = dt.mutual_information()
mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:1000: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = new_col.cat.codes
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:992: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = pd.qcut(data[col_name], num_bins, duplicates="drop")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:1000: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = new_col.cat.codes
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:995: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = pd.qcut(data[col_name].astype('int64'), num_bins, duplicates="drop")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:1000: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = new_col.cat.codes
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:992: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = pd.qcut(data[col_name], num_bins, duplicates="drop")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:1000: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = new_col.cat.codes
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:992: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = pd.qcut(data[col_name], num_bins, duplicates="drop")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:1000: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = new_col.cat.codes
[5]:
column_1 column_2 mutual_info
2 product_id unit_price 0.426383
6 quantity total 0.184497
11 order_id quantity 0.114714
12 unit_price total 0.103210
14 product_id quantity 0.088663
15 quantity customer_name 0.085515
16 quantity unit_price 0.082515
17 order_id unit_price 0.077681
20 unit_price customer_name 0.041308
21 quantity cancelled 0.035528
25 quantity country 0.021515
30 order_date unit_price 0.003180
32 unit_price country 0.002603
33 quantity order_date 0.002146
34 unit_price cancelled 0.001677
[6]:
mi = dt.mutual_information(num_bins = 50)
mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:1000: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = new_col.cat.codes
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:992: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = pd.qcut(data[col_name], num_bins, duplicates="drop")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:1000: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = new_col.cat.codes
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:995: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = pd.qcut(data[col_name].astype('int64'), num_bins, duplicates="drop")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:1000: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = new_col.cat.codes
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:992: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = pd.qcut(data[col_name], num_bins, duplicates="drop")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:1000: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = new_col.cat.codes
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:992: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = pd.qcut(data[col_name], num_bins, duplicates="drop")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.7/site-packages/woodwork/datatable.py:1000: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col_name] = new_col.cat.codes
[6]:
column_1 column_2 mutual_info
2 product_id unit_price 0.528865
4 unit_price total 0.405555
7 quantity total 0.349243
10 order_id quantity 0.157188
13 product_id quantity 0.143938
14 order_id unit_price 0.140257
16 quantity customer_name 0.113431
17 quantity unit_price 0.105052
18 quantity cancelled 0.081334
19 unit_price customer_name 0.078942
24 quantity country 0.023758
27 order_date unit_price 0.011905
30 unit_price country 0.006311
31 quantity order_date 0.004170
34 unit_price cancelled 0.001671