Using Woodwork with Dask DataFrames¶

Woodwork enables DataTables to be created from Dask DataFrames when working with datasets that are too large to easily fit in memory. Although creating a DataTable from a Dask DataFrame follows the same process as one would follow when creating a DataTable from a pandas DataFrame, there are a few limitations to be aware of. This guide will provide a brief overview of creating a DataTable starting with a Dask DataFrame, and will outline several key items to keep in mind when using a Dask DataFrame as input.

Dask DataTables require the installation of the Dask library, which can be installed directly with the following command:

python -m pip install "woodwork[dask]"

First we will create a Dask DataFrame to use in our example. Normally you would create the DataFrame directly by reading in the data from saved files, but we will create it from a demo pandas DataFrame.

[1]:

import dask.dataframe as dd
import woodwork as ww

df_pandas = ww.demo.load_retail(nrows=1000, return_dataframe=True)
df_dask = dd.from_pandas(df_pandas, npartitions=10)
df_dask

[1]:

Dask DataFrame Structure:

	order_product_id	order_id	product_id	description	quantity	order_date	unit_price	customer_name	country	total	cancelled
npartitions=10
0	int64	object	object	object	int64	datetime64[ns]	float64	object	object	float64	bool
100	...	...	...	...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...	...	...	...	...
900	...	...	...	...	...	...	...	...	...	...	...
999	...	...	...	...	...	...	...	...	...	...	...

Dask Name: from_pandas, 10 tasks

Now that we have a Dask DataFrame, we can use it to create a Woodwork DataTable, just as we would with a pandas DataFrame:

[2]:

dt = ww.DataTable(df_dask, index='order_product_id')
dt.types

[2]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
order_product_id	Int64	WholeNumber	{index}
order_id	category	Categorical	{category}
product_id	category	Categorical	{category}
description	string	NaturalLanguage	{}
quantity	Int64	WholeNumber	{numeric}
order_date	datetime64[ns]	Datetime	{}
unit_price	float64	Double	{numeric}
customer_name	string	NaturalLanguage	{}
country	string	NaturalLanguage	{}
total	float64	Double	{numeric}
cancelled	boolean	Boolean	{}

As you can see from the output above, the DataTable was created successfully, and logical type inference was performed for all of the columns. However, this brings us to one of the key issues in working with Dask DataFrames.

In order to perform logical type inference, Woodwork needs to bring the data into memory so it can be analyzed. Currently, Woodwork reads data from the first partition of data only, and then uses this data for type inference. Depending on the complexity of the data, this could be a time consuming operation. Additionally, if the first partition is not representative of the entire dataset, the logical types for some columns may be inferred incorrectly.

If this process takes too much time, or if the logical types are not inferred correctly, users have the ability to manually specify the logical types for each column. If the logical type for a column is specified, type inference for that column will be skipped. If logical types are specified for all columns, logical type inference will be skipped completely and Woodwork will not need to bring any of the data into memory when creating the DataTable.

To skip logical type inference completely, and/or to correct type inference issues, you would simply define a logical types dictionary with the correct logical type defined for each column in the dataframe. Then, pass that dictionary to the call to create the DataTable as shown below:

[3]:

logical_types = {
    'order_product_id': 'WholeNumber',
    'order_id': 'Categorical',
    'product_id': 'Categorical',
    'description': 'NaturalLanguage',
    'quantity': 'WholeNumber',
    'order_date': 'Datetime',
    'unit_price': 'Double',
    'customer_name': 'FullName',
    'country': 'Categorical',
    'total': 'Double',
    'cancelled': 'Boolean',
}

dt = ww.DataTable(df_dask, index='order_product_id', logical_types=logical_types)
dt.types

[3]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
order_product_id	Int64	WholeNumber	{index}
order_id	category	Categorical	{category}
product_id	category	Categorical	{category}
description	string	NaturalLanguage	{}
quantity	Int64	WholeNumber	{numeric}
order_date	datetime64[ns]	Datetime	{}
unit_price	float64	Double	{numeric}
customer_name	string	FullName	{}
country	category	Categorical	{category}
total	float64	Double	{numeric}
cancelled	boolean	Boolean	{}

There are three DataTable methods that also require bringing the underlying Dask DataFrame into memory: describe, value_counts and get_mutual_information. When called, these methods will call a compute operation on the DataFrame associated with the DataTable in order to calculate the desired information. This may be problematic for datasets that cannot fit in memory, so exercise caution when using these methods.

[4]:

dt.describe(include=['numeric'])

[4]:

	quantity	unit_price	total
physical_type	Int64	float64	float64
logical_type	WholeNumber	Double	Double
semantic_tags	{numeric}	{numeric}	{numeric}
count	1000	1000	1000
nunique	43	61	232
nan_count	0	0	0
mean	12.735	5.00366	40.3905
mode	1	2.0625	24.75
std	38.4016	9.73817	123.994
min	-24	0.165	-68.31
first_quartile	2	2.0625	5.709
second_quartile	4	3.34125	17.325
third_quartile	12	6.1875	33.165
max	600	272.25	2684.88
num_true	NaN	NaN	NaN
num_false	NaN	NaN	NaN

[5]:

dt.value_counts()

[5]:

{'order_id': [{'value': '536464', 'count': 81},
  {'value': '536520', 'count': 71},
  {'value': '536412', 'count': 68},
  {'value': '536401', 'count': 64},
  {'value': '536415', 'count': 59},
  {'value': '536409', 'count': 54},
  {'value': '536408', 'count': 48},
  {'value': '536381', 'count': 35},
  {'value': '536488', 'count': 34},
  {'value': '536446', 'count': 31}],
 'product_id': [{'value': '22632', 'count': 11},
  {'value': '85123A', 'count': 10},
  {'value': '22633', 'count': 10},
  {'value': '84029E', 'count': 9},
  {'value': '22961', 'count': 9},
  {'value': '22960', 'count': 7},
  {'value': '84879', 'count': 7},
  {'value': '22866', 'count': 7},
  {'value': '21212', 'count': 7},
  {'value': '22197', 'count': 7}],
 'country': [{'value': 'United Kingdom', 'count': 964},
  {'value': 'France', 'count': 20},
  {'value': 'Australia', 'count': 14},
  {'value': 'Netherlands', 'count': 2}]}

[6]:

dt.get_mutual_information().head()

[6]:

	column_1	column_2	mutual_info
0	order_id	product_id	0.595564
7	product_id	unit_price	0.517738
9	product_id	total	0.433166
6	product_id	quantity	0.386756
1	order_id	quantity	0.281075

Data Validation Limitations¶

When creating a DataTable several validation checks are performed to confirm that the data in the underlying dataframe is appropriate for the specified parameters. Because some of these validation steps require pulling the underlying data into memory, they are skipped when creating a DataTable from a Dask DataFrame. This section provides an overview of the validation checks that are performed with pandas input but skipped with Dask input.

Index Uniqueness¶

Normally a check is performed to verify that any column specified as the index contains no duplicate values. This check is skipped for Dask, so users must manually verify that any column specified as an index column contains unique values.

Data Consistency with LogicalType¶

If users manually define the LogicalType for a column when creating the DataTable, a check is performed to verify that the data in that column is appropriate for the specified LogicalType. For example, with pandas input if the user specifies a LogicalType of Double for a column that contains letters such as ['a', 'b', 'c'], an error would be raised as it is not possible to convert the letters into numeric values with the float dtype associated with the Double LogicalType.

With Dask input, no such error would be raised at the time of DataTable creation. However, behind the scenes, Woodwork will have attempted to convert the column physical type to float, and this conversion would be added to the Dask task graph, without raising an error. However, an error will be raised if a compute operation is called on the underlying DataFrame once Dask attempts to execute the conversion step. Extra care should be taken when using Dask input to make sure any specified logical types are consistent with the data in the columns to avoid this type of error.

Similarly, for the Ordinal LogicalType, a check is typically performed to make sure that the data column does not contain any values that are not present in the defined order values. This check will not be performed with Dask input. Users should manually verify that the defined order values are complete to avoid unexpected results.

Other Limitations¶

Reading from CSV Files¶

Woodwork provides the ability to read data directly from a CSV file into a DataTable, and during this process Woodwork creates the underlying dataframe so the user does not have to do so. The helper function used for this, woodwork.read_csv, will currently only read the data into a pandas DataFrame. At some point, we hope to remove this limitation and also allow data to be read into a Dask DataFrame, but for now only pandas DataFrames can be created by this function.

Equality of Dask DataTables¶

In order to avoid bringing a Dask DataFrame into memory, Woodwoork will not consider the equality of the underlying data when checking whether a DataTable made from a Dask DataFrame is equal to another DataTable. This means that two DataTables with identical names, columns, indices, semantic tags, and LogicalTypes but different underlying data will return True if at least one of them uses Dask.

Gain Statistical Insights into Your DataTable API Reference