Saving and Loading DataFrames¶
In this guide, you will learn how to save and load Woodwork DataFrames.
Saving a Woodwork DataFrame¶
After defining a Woodwork DataFrame with the proper logical types and semantic tags, you can save the DataFrame and typing information by using DataFrame.ww.to_disk
. This method will create a directory that contains a data
folder and a woodwork_typing_info.json
file. To illustrate, we will use this retail DataFrame which already comes configured with Woodwork typing information.
[1]:
from woodwork.demo import load_retail
df = load_retail(nrows=100)
df.ww.schema
[1]:
Logical Type | Semantic Tag(s) | |
---|---|---|
Column | ||
order_product_id | Categorical | ['index'] |
order_id | Categorical | ['category'] |
product_id | Categorical | ['category'] |
description | NaturalLanguage | [] |
quantity | Integer | ['numeric'] |
order_date | Datetime | ['time_index'] |
unit_price | Double | ['numeric'] |
customer_name | Categorical | ['category'] |
country | Categorical | ['category'] |
total | Double | ['numeric'] |
cancelled | Boolean | [] |
[2]:
df.head()
[2]:
order_product_id | order_id | product_id | description | quantity | order_date | unit_price | customer_name | country | total | cancelled | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 4.2075 | Andrea Brown | United Kingdom | 25.245 | False |
1 | 1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 5.5935 | Andrea Brown | United Kingdom | 33.561 | False |
2 | 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 4.5375 | Andrea Brown | United Kingdom | 36.300 | False |
3 | 3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 5.5935 | Andrea Brown | United Kingdom | 33.561 | False |
4 | 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 5.5935 | Andrea Brown | United Kingdom | 33.561 | False |
From the ww
acessor, use to_disk
to save the Woodwork DataFrame.
[3]:
df.ww.to_disk('retail')
You should see a new directory that contains the data and typing information.
retail
├── data
│ └── demo_retail_data.csv
└── woodwork_typing_info.json
Data Directory¶
The data
directory contains the underlying data written in the specified format. The method derives the filename from DataFrame.ww.name
and uses CSV as the default format. You can change the format by setting the method’s format
parameter to any of the following formats:
csv (default)
pickle
parquet
Typing Information¶
In woodwork_typing_info.json
, you can see all of the typing information and metadata associated with the DataFrame. This information includes:
the version of the schema at the time of saving the DataFrame
the DataFrame name specified by
DataFrame.ww.name
the column names for the index and time index
the column typing information, which contains the logical types with their parameters and semantic tags for each column
the loading information required for the DataFrame type and file format
the table metadata provided by
DataFrame.ww.metadata
(must be JSON serializable)
{
"schema_version": "10.0.2",
"name": "demo_retail_data",
"index": "order_product_id",
"time_index": "order_date",
"column_typing_info": [...],
"loading_info": {
"table_type": "pandas",
"location": "data/demo_retail_data.csv",
"type": "csv",
"params": {
"compression": null,
"sep": ",",
"encoding": "utf-8",
"engine": "python",
"index": false
}
},
"table_metadata": {}
}
Loading a Woodwork DataFrame¶
After saving a Woodwork DataFrame, you can load the DataFrame and typing information by using woodwork.deserialize.read_woodwork_table
. This function will use the stored typing information in the specified directory to recreate the Woodwork DataFrame.
[4]:
from woodwork.deserialize import read_woodwork_table
df = read_woodwork_table('retail')
df.ww.schema
[4]:
Logical Type | Semantic Tag(s) | |
---|---|---|
Column | ||
order_product_id | Categorical | ['index'] |
order_id | Categorical | ['category'] |
product_id | Categorical | ['category'] |
description | NaturalLanguage | [] |
quantity | Integer | ['numeric'] |
order_date | Datetime | ['time_index'] |
unit_price | Double | ['numeric'] |
customer_name | Categorical | ['category'] |
country | Categorical | ['category'] |
total | Double | ['numeric'] |
cancelled | Boolean | [] |
Loading the DataFrame and typing information separately¶
You can also use woodwork.read_file
to load a Woodwork DataFrame without the typing information. This approach is helpful if you want to quickly get started and let Woodwork infer the typing information based on the underlying data. To illustrate, let’s read the CSV file from the previous example directly into a Woodwork DataFrame.
[5]:
from woodwork import read_file
df = read_file('retail/data/demo_retail_data.csv')
df.ww
[5]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
order_product_id | int64 | Integer | ['numeric'] |
order_id | int64 | Integer | ['numeric'] |
product_id | string | Unknown | [] |
description | string | NaturalLanguage | [] |
quantity | int64 | Integer | ['numeric'] |
order_date | datetime64[ns] | Datetime | [] |
unit_price | float64 | Double | ['numeric'] |
customer_name | category | Categorical | ['category'] |
country | category | Categorical | ['category'] |
total | float64 | Double | ['numeric'] |
cancelled | bool | Boolean | [] |
The typing information is optional in read_file
. So, you can still specify the typing information parameters to control how Woodwork gets initialized. To illustrate, we will read data files in different formats directly into a Woodwork DataFrame and use this typing information.
[6]:
typing_information = {
'index': 'order_product_id',
'time_index': 'order_date',
'logical_types': {
'order_product_id': 'Categorical',
'order_id': 'Categorical',
'product_id': 'Categorical',
'description': 'NaturalLanguage',
'quantity': 'Integer',
'order_date': 'Datetime',
'unit_price': 'Double',
'customer_name': 'Categorical',
'country': 'Categorical',
'total': 'Double',
'cancelled': 'Boolean',
},
'semantic_tags': {
'order_id': {'category'},
'product_id': {'category'},
'quantity': {'numeric'},
'unit_price': {'numeric'},
'customer_name': {'category'},
'country': {'category'},
'total': {'numeric'},
},
}
First, let’s create the data files in different formats from a pandas DataFrame.
[7]:
import pandas as pd
pandas_df = pd.read_csv('retail/data/demo_retail_data.csv')
pandas_df.to_csv('retail/data.csv')
pandas_df.to_parquet('retail/data.parquet')
pandas_df.to_feather('retail/data.feather')
Now, you can use read_file
to load the data directly into a Woodwork DataFrame based on your typing information. This function uses the content_type
parameter to determine the file format. If content_type
is not specified, it will try to infer the file format from the file extension.
[8]:
woodwork_df = read_file(
filepath='retail/data.csv',
content_type='text/csv',
**typing_information,
)
woodwork_df = read_file(
filepath='retail/data.parquet',
content_type='application/parquet',
**typing_information,
)
woodwork_df = read_file(
filepath='retail/data.feather',
content_type='application/feather',
**typing_information,
)
woodwork_df.ww
[8]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
order_product_id | category | Categorical | ['index'] |
order_id | category | Categorical | ['category'] |
product_id | category | Categorical | ['category'] |
description | string | NaturalLanguage | [] |
quantity | int64 | Integer | ['numeric'] |
order_date | datetime64[ns] | Datetime | ['time_index'] |
unit_price | float64 | Double | ['numeric'] |
customer_name | category | Categorical | ['category'] |
country | category | Categorical | ['category'] |
total | float64 | Double | ['numeric'] |
cancelled | bool | Boolean | [] |