Using Woodwork with Dask and Koalas DataFrames — Woodwork 0.0.11 documentation

Dask DataTable Example¶

Create a Dask DataFrame to use in our example. Normally you create the DataFrame directly by reading in the data from saved files, but you will create it from a demo pandas DataFrame.

[1]:

import dask.dataframe as dd
import woodwork as ww

df_pandas = ww.demo.load_retail(nrows=1000, return_dataframe=True)
df_dask = dd.from_pandas(df_pandas, npartitions=10)
df_dask

[1]:

Dask DataFrame Structure:

	order_product_id	order_id	product_id	description	quantity	order_date	unit_price	customer_name	country	total	cancelled
npartitions=10
0	int64	object	object	object	int64	datetime64[ns]	float64	object	object	float64	bool
100	...	...	...	...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...	...	...	...	...
900	...	...	...	...	...	...	...	...	...	...	...
999	...	...	...	...	...	...	...	...	...	...	...

Dask Name: from_pandas, 10 tasks

Now that you have a Dask DataFrame, you can use it to create a Woodwork DataTable, just as you would with a pandas DataFrame:

[2]:

dt = ww.DataTable(df_dask, index='order_product_id')
dt

[2]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
order_product_id	Int64	Integer	['index']
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	string	NaturalLanguage	[]
quantity	Int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	string	NaturalLanguage	[]
country	string	NaturalLanguage	[]
total	float64	Double	['numeric']
cancelled	boolean	Boolean	[]

As you can see from the output above, the DataTable has been created successfully, and logical type inference was performed for all of the columns.

However, that illustrates one of the key issues in working with Dask DataFrames. In order to perform logical type inference, Woodwork needs to bring the data into memory so it can be analyzed. Currently, Woodwork reads data from the first partition of data only, and then uses this data for type inference. Depending on the complexity of the data, this could be a time consuming operation. Additionally, if the first partition is not representative of the entire dataset, the logical types for some columns may be inferred incorrectly.

Skipping or Overriding Type Inference¶

If this process takes too much time, or if the logical types are not inferred correctly, you can manually specify the logical types for each column. If the logical type for a column is specified, type inference for that column will be skipped. If logical types are specified for all columns, logical type inference will be skipped completely and Woodwork will not need to bring any of the data into memory when creating the DataTable.

To skip logical type inference completely or to correct type inference issues, define a logical types dictionary with the correct logical type defined for each column in the dataframe, then pass that dictionary to the call to create the DataTable.

[3]:

logical_types = {
    'order_product_id': 'Integer',
    'order_id': 'Categorical',
    'product_id': 'Categorical',
    'description': 'NaturalLanguage',
    'quantity': 'Integer',
    'order_date': 'Datetime',
    'unit_price': 'Double',
    'customer_name': 'FullName',
    'country': 'Categorical',
    'total': 'Double',
    'cancelled': 'Boolean',
}

dt = ww.DataTable(df_dask, index='order_product_id', logical_types=logical_types)
dt

[3]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
order_product_id	Int64	Integer	['index']
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	string	NaturalLanguage	[]
quantity	Int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	string	FullName	[]
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	boolean	Boolean	[]

Analyzing Underlying Data¶

There are three DataTable methods that require bringing the underlying Dask DataFrame into memory: describe, value_counts and mutual_information. When called, these methods will call a compute operation on the DataFrame associated with the DataTable to calculate the desired information. This might be problematic for datasets that cannot fit in memory, so exercise caution when using these methods.

[4]:

dt.describe(include=['numeric'])

[4]:

	quantity	unit_price	total
physical_type	Int64	float64	float64
logical_type	Integer	Double	Double
semantic_tags	{numeric}	{numeric}	{numeric}
count	1000	1000	1000
nunique	43	61	232
nan_count	0	0	0
mean	12.735	5.00366	40.3905
mode	1	2.0625	24.75
std	38.4016	9.73817	123.994
min	-24	0.165	-68.31
first_quartile	2	2.0625	5.709
second_quartile	4	3.34125	17.325
third_quartile	12	6.1875	33.165
max	600	272.25	2684.88
num_true	NaN	NaN	NaN
num_false	NaN	NaN	NaN

[5]:

dt.value_counts()

[5]:

{'order_id': [{'value': '536464', 'count': 81},
  {'value': '536520', 'count': 71},
  {'value': '536412', 'count': 68},
  {'value': '536401', 'count': 64},
  {'value': '536415', 'count': 59},
  {'value': '536409', 'count': 54},
  {'value': '536408', 'count': 48},
  {'value': '536381', 'count': 35},
  {'value': '536488', 'count': 34},
  {'value': '536446', 'count': 31}],
 'product_id': [{'value': '22632', 'count': 11},
  {'value': '85123A', 'count': 10},
  {'value': '22633', 'count': 10},
  {'value': '84029E', 'count': 9},
  {'value': '22961', 'count': 9},
  {'value': '22960', 'count': 7},
  {'value': '84879', 'count': 7},
  {'value': '22866', 'count': 7},
  {'value': '21212', 'count': 7},
  {'value': '22197', 'count': 7}],
 'country': [{'value': 'United Kingdom', 'count': 964},
  {'value': 'France', 'count': 20},
  {'value': 'Australia', 'count': 14},
  {'value': 'Netherlands', 'count': 2}]}

[6]:

dt.mutual_information().head()

[6]:

	column_1	column_2	mutual_info
0	order_id	order_date	0.777905
1	order_id	product_id	0.595564
2	product_id	unit_price	0.517738
3	product_id	total	0.433166
4	product_id	order_date	0.404885

Koalas DataTable Example¶

As above, first create a Koalas DataFrame to use in our example. Normally you create the DataFrame directly by reading in the data from saved files, but here you create it from a demo pandas DataFrame.

[7]:

import databricks.koalas as ks

df_koalas = ks.from_pandas(df_pandas)
df_koalas.head()

[7]:

	order_product_id	order_id	product_id	description	quantity	order_date	unit_price	customer_name	country	total	cancelled
0	0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	4.2075	Andrea Brown	United Kingdom	25.245	False
1	1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False
2	2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	4.5375	Andrea Brown	United Kingdom	36.300	False
3	3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False
4	4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False

Now that you have a Koalas DataFrame, you can use it to create a Woodwork DataTable, just as you would with a pandas DataFrame:

[8]:

dt = ww.DataTable(df_koalas, index='order_product_id')
dt

[8]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
order_product_id	int64	Integer	['index']
order_id	object	Categorical	['category']
product_id	object	Categorical	['category']
description	object	NaturalLanguage	[]
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	object	NaturalLanguage	[]
country	object	NaturalLanguage	[]
total	float64	Double	['numeric']
cancelled	bool	Boolean	[]

As you can see from the output above, the DataTable has been created successfully, and logical type inference was performed for all of the columns.

Notes on Koalas Dtype Conversions¶

In the types table above, one important thing to notice is that the physical types for the Koalas DataTable are quite different than the physical types for the Dask DataTable. The reason for this is that several of the physical types that can be used with pandas and Dask can’t be used with Koalas.

When a Woodwork DataTable is created, the dtype of the underlying dataframe columns are converted to a set of standard dtypes, defined by the LogicalType pandas_dtype property. Woodwork uses several nullable dtypes such as Int64, string and boolean which are not compatible with Koalas. In addition the category dtype cannot be used with Koalas DataFrames.

For LogicalTypes that have pandas_dtype properties that are not compatible with Koalas, Woodwork will try to convert the column dtype, but will be unsuccessful. At that point, Woodwork will use a backup dtype that is compatible with Koalas. The implication of this is that creating a DataTable from a Koalas DataFrame may result in dtype values that are different than the values you would get when creating the DataTable from an identical pandas DataFrame.

The backup datatypes used are as follows: Int64: int64 boolean: bool string: object category: object

Since Koalas does not support the category dtype, any column that is inferred or specified with a logical type of Categorical will have its values converted to strings and stored with a dtype of object. This means that a categorical column containing numeric values, will be converted into the equivalent string values when the DataTable is created.

As Koalas does not support the nullable boolean dtype, Woodwook will fall back to use the compatible bool dtype for columns with a Boolean LogicalType. If the corresponding column contains any missing values, these missing values will be converted to False values when the column dtype is updated.

Finally, Koalas does not support the timedelta64[ns] dtype. For this, there is not a clean backup dtype, so the use of Timedelta LogicalType is not supported in Koalas DataTables.

Skipping or Overriding Type Inference¶

As with Dask, Woodwork must bring the data into memory so it can be analyzed for type inference. Currently, Woodwork reads the first 100,000 rows of data to use for type inference when using a Koalas DataFrame as input. If the first 100,000 rows are not representative of the entire dataset, the logical types for some columns might be inferred incorrectly.

To skip logical type inference completely or to correct type inference issues, define a logical types dictionary with the correct logical type defined for each column in the dataframe.

[9]:

logical_types = {
    'order_product_id': 'Integer',
    'order_id': 'Categorical',
    'product_id': 'Categorical',
    'description': 'NaturalLanguage',
    'quantity': 'Integer',
    'order_date': 'Datetime',
    'unit_price': 'Double',
    'customer_name': 'FullName',
    'country': 'Categorical',
    'total': 'Double',
    'cancelled': 'Boolean',
}

dt = ww.DataTable(df_koalas, index='order_product_id', logical_types=logical_types)
dt

[9]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
order_product_id	int64	Integer	['index']
order_id	object	Categorical	['category']
product_id	object	Categorical	['category']
description	object	NaturalLanguage	[]
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	object	FullName	[]
country	object	Categorical	['category']
total	float64	Double	['numeric']
cancelled	bool	Boolean	[]

Analyzing Underlying Data¶

As with Dask, running describe, value_counts or mutual_information requires bringing the data into memory to perform the analysis. When called, these methods will call a to_pandas operation on the DataFrame associated with the DataTable to calculate the desired information. This may be problematic for very large datasets, so exercise caution when using these methods.

[10]:

dt.describe(include=['numeric'])

[10]:

	quantity	unit_price	total
physical_type	int64	float64	float64
logical_type	Integer	Double	Double
semantic_tags	{numeric}	{numeric}	{numeric}
count	1000	1000	1000
nunique	43	61	232
nan_count	0	0	0
mean	12.735	5.00366	40.3905
mode	1	2.0625	24.75
std	38.4016	9.73817	123.994
min	-24	0.165	-68.31
first_quartile	2	2.0625	5.709
second_quartile	4	3.34125	17.325
third_quartile	12	6.1875	33.165
max	600	272.25	2684.88
num_true	NaN	NaN	NaN
num_false	NaN	NaN	NaN

[11]:

dt.value_counts()

[11]:

{'order_id': [{'value': '536464', 'count': 81},
  {'value': '536520', 'count': 71},
  {'value': '536412', 'count': 68},
  {'value': '536401', 'count': 64},
  {'value': '536415', 'count': 59},
  {'value': '536409', 'count': 54},
  {'value': '536408', 'count': 48},
  {'value': '536381', 'count': 35},
  {'value': '536488', 'count': 34},
  {'value': '536446', 'count': 31}],
 'product_id': [{'value': '22632', 'count': 11},
  {'value': '22633', 'count': 10},
  {'value': '85123A', 'count': 10},
  {'value': '22961', 'count': 9},
  {'value': '84029E', 'count': 9},
  {'value': '22960', 'count': 7},
  {'value': '84879', 'count': 7},
  {'value': '22866', 'count': 7},
  {'value': '21212', 'count': 7},
  {'value': '22197', 'count': 7}],
 'country': [{'value': 'United Kingdom', 'count': 964},
  {'value': 'France', 'count': 20},
  {'value': 'Australia', 'count': 14},
  {'value': 'Netherlands', 'count': 2}]}

[12]:

dt.mutual_information().head()

[12]:

	column_1	column_2	mutual_info
0	order_id	order_date	0.777905
1	order_id	product_id	0.595564
2	product_id	unit_price	0.517738
3	product_id	total	0.433166
4	product_id	order_date	0.404885

Data Validation Limitations¶

When creating a DataTable several validation checks are performed to confirm that the data in the underlying dataframe is appropriate for the specified parameters. Because some of these validation steps require pulling the underlying data into memory, they are skipped when creating a DataTable from a Dask or Koalas DataFrame. This section provides an overview of the validation checks that are performed with pandas input but skipped with Dask or Koalas input.

Index Uniqueness¶

Normally a check is performed to verify that any column specified as the index contains no duplicate values. With Dask or Koalas input, this check is skipped and you must manually verify that any column specified as an index column contains unique values.

Data Consistency with LogicalType (Dask Only)¶

If you manually define the LogicalType for a column when creating the DataTable, a check is performed to verify that the data in that column is appropriate for the specified LogicalType. For example, with pandas input if you specify a LogicalType of Double for a column that contains letters such as ['a', 'b', 'c'], an error is raised because it is not possible to convert the letters into numeric values with the float dtype associated with the Double LogicalType.

With Dask input, no such error appears at the time of DataTable creation. However, behind the scenes, Woodwork attempts to convert the column physical type to float, and this conversion is added to the Dask task graph, without raising an error. However, an error is raised if a compute operation is called on the underlying DataFrame once Dask attempts to execute the conversion step. Extra care should be taken when using Dask input to make sure any specified logical types are consistent with the data in the columns to avoid this type of error.

Ordinal Order Values Check¶

For the Ordinal LogicalType, a check is typically performed to make sure that the data column does not contain any values that are not present in the defined order values. This check will not be performed with Dask or Koalas input. Users should manually verify that the defined order values are complete to avoid unexpected results.

Other Limitations¶

Reading from CSV Files¶

Woodwork provides the ability to read data directly from a CSV file into a DataTable. During that process, Woodwork creates the underlying dataframe so the user does not have to do so. The helper function used for this, woodwork.read_csv, currently only reads the data into a pandas DataFrame. At some point, this limitation may be removed, allowing data to be read into a Dask or Koalas DataFrame. For now, only pandas DataFrames can be created with this function.

Sorting DataFrame on Time Index¶

When creating a DataTable with a time index, Woodwork, by default, will sort the input dataframe first on the time index and then on the index, if specified. Because sorting a distributed DataFrame is a computationally expensive operation, this sorting is performed only when creating a DataTable from a pandas DataFrame. If a sorted DataFrame is needed when using a Dask or Koalas, the user should first sort the DataFrame before creating the DataTable.

Equality of DataTables¶

In order to avoid bringing a Dask DataFrame into memory, Woodwoork does not consider the equality of the underlying data when checking whether a DataTable made from a Dask or Koalas DataFrame is equal to another DataTable. This means that two DataTables with identical names, columns, indices, semantic tags, and LogicalTypes but different underlying data will return True if at least one of them uses Dask or Koalas.

LatLong Columns¶

When working with the LatLong logical type, Woodwork converts all LatLong columns to a standard format of a tuple of floats for Dask dataframes and a list of floats for Koalas dataframes. In order to do this, the data is read into memory, which may be problematic for large datatsets.

Integer Column Names¶

Woodwork generally allows column names of any format that is supported by the underlying DataFrame. However, Dask DataTables do not currently support integer column names.

Using Woodwork with Dask and Koalas DataFrames¶

Dask DataTable Example¶

Skipping or Overriding Type Inference¶

Analyzing Underlying Data¶

Koalas DataTable Example¶

Notes on Koalas Dtype Conversions¶

Skipping or Overriding Type Inference¶

Analyzing Underlying Data¶

Data Validation Limitations¶

Index Uniqueness¶

Data Consistency with LogicalType (Dask Only)¶

Ordinal Order Values Check¶

Other Limitations¶

Reading from CSV Files¶

Sorting DataFrame on Time Index¶

Equality of DataTables¶

LatLong Columns¶

Integer Column Names¶