Saving and Loading DataFrames#
In this guide, you will learn how to save and load Woodwork DataFrames.
Saving a Woodwork DataFrame#
After defining a Woodwork DataFrame with the proper logical types and semantic tags, you can save the DataFrame and typing information by using DataFrame.ww.to_disk
. By default, this method will create a directory that contains a data
folder and a woodwork_typing_info.json
file, but users have the ability to specify different values if needed. Refer to the API
Guide for more information on the parameters that can be specified when using the to_disk
method.
To illustrate, we will use this retail DataFrame which already comes configured with Woodwork typing information.
[1]:
from woodwork.demo import load_retail
df = load_retail(nrows=100)
df.ww.schema
[1]:
Logical Type | Semantic Tag(s) | |
---|---|---|
Column | ||
order_product_id | Categorical | ['index'] |
order_id | Categorical | ['category'] |
product_id | Categorical | ['category'] |
description | NaturalLanguage | [] |
quantity | Integer | ['numeric'] |
order_date | Datetime | ['time_index'] |
unit_price | Double | ['numeric'] |
customer_name | Categorical | ['category'] |
country | Categorical | ['category'] |
total | Double | ['numeric'] |
cancelled | Boolean | [] |
[2]:
df.head()
[2]:
order_product_id | order_id | product_id | description | quantity | order_date | unit_price | customer_name | country | total | cancelled | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 4.2075 | Andrea Brown | United Kingdom | 25.245 | False |
1 | 1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 5.5935 | Andrea Brown | United Kingdom | 33.561 | False |
2 | 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 4.5375 | Andrea Brown | United Kingdom | 36.300 | False |
3 | 3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 5.5935 | Andrea Brown | United Kingdom | 33.561 | False |
4 | 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 5.5935 | Andrea Brown | United Kingdom | 33.561 | False |
From the ww
acessor, use to_disk
to save the Woodwork DataFrame.
[3]:
df.ww.to_disk("retail")
You should see a new directory that contains the data and typing information.
retail
├── data
│ └── demo_retail_data.csv
└── woodwork_typing_info.json
Data Directory#
The data
directory contains the underlying data written in the specified format. If the user does not specify a filename, the method derives the filename from DataFrame.ww.name
and uses CSV as the default format. You can change the format by setting the method’s format
parameter to any of the following formats:
csv (default)
pickle
parquet
arrow
feather
orc
Typing Information#
In woodwork_typing_info.json
, you can see all of the typing information and metadata associated with the DataFrame. This information includes:
the version of the schema at the time of saving the DataFrame
the DataFrame name specified by
DataFrame.ww.name
the column names for the index and time index
the column typing information, which contains the logical types with their parameters and semantic tags for each column
the loading information required for the DataFrame type and file format
the table metadata provided by
DataFrame.ww.metadata
(must be JSON serializable)
{
"schema_version": "10.0.2",
"name": "demo_retail_data",
"index": "order_product_id",
"time_index": "order_date",
"column_typing_info": [...],
"loading_info": {
"table_type": "pandas",
"location": "data/demo_retail_data.csv",
"type": "csv",
"params": {
"compression": null,
"sep": ",",
"encoding": "utf-8",
"engine": "python",
"index": false
}
},
"table_metadata": {}
}
Loading a Woodwork DataFrame#
After saving a Woodwork DataFrame, you can load the DataFrame and typing information by using woodwork.deserialize.from_disk
. This function will use the stored typing information in the specified directory to recreate the Woodwork DataFrame.
If you have modified any of the default values for the filename, data subdirectory or typing information file, you will need to specify those when calling from_disk
. Since we did not change any of the defaults for this example, we do not need to specify them here.
[4]:
from woodwork.deserialize import from_disk
df = from_disk("retail")
df.ww.schema
[4]:
Logical Type | Semantic Tag(s) | |
---|---|---|
Column | ||
order_product_id | Categorical | ['index'] |
order_id | Categorical | ['category'] |
product_id | Categorical | ['category'] |
description | NaturalLanguage | [] |
quantity | Integer | ['numeric'] |
order_date | Datetime | ['time_index'] |
unit_price | Double | ['numeric'] |
customer_name | Categorical | ['category'] |
country | Categorical | ['category'] |
total | Double | ['numeric'] |
cancelled | Boolean | [] |
Loading the DataFrame and typing information separately#
You can also use woodwork.read_file
to load a Woodwork DataFrame without the typing information. This approach is helpful if you want to quickly get started and let Woodwork infer the typing information based on the underlying data. To illustrate, let’s read the CSV file from the previous example directly into a Woodwork DataFrame.
[5]:
from woodwork import read_file
df = read_file("retail/data/demo_retail_data.csv")
df.ww
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[5]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
order_product_id | int64 | Integer | ['numeric'] |
order_id | int64 | Integer | ['numeric'] |
product_id | string | Unknown | [] |
description | string | NaturalLanguage | [] |
quantity | int64 | Integer | ['numeric'] |
order_date | datetime64[ns] | Datetime | [] |
unit_price | float64 | Double | ['numeric'] |
customer_name | category | Categorical | ['category'] |
country | category | Categorical | ['category'] |
total | float64 | Double | ['numeric'] |
cancelled | bool | Boolean | [] |
The typing information is optional in read_file
. So, you can still specify the typing information parameters to control how Woodwork gets initialized. To illustrate, we will read data files in different formats directly into a Woodwork DataFrame and use this typing information.
[6]:
typing_information = {
"index": "order_product_id",
"time_index": "order_date",
"logical_types": {
"order_product_id": "Categorical",
"order_id": "Categorical",
"product_id": "Categorical",
"description": "NaturalLanguage",
"quantity": "Integer",
"order_date": "Datetime",
"unit_price": "Double",
"customer_name": "Categorical",
"country": "Categorical",
"total": "Double",
"cancelled": "Boolean",
},
"semantic_tags": {
"order_id": {"category"},
"product_id": {"category"},
"quantity": {"numeric"},
"unit_price": {"numeric"},
"customer_name": {"category"},
"country": {"category"},
"total": {"numeric"},
},
}
First, let’s create the data files in different formats from a pandas DataFrame.
[7]:
import pandas as pd
pandas_df = pd.read_csv("retail/data/demo_retail_data.csv")
pandas_df.to_csv("retail/data.csv")
pandas_df.to_parquet("retail/data.parquet")
pandas_df.to_feather("retail/data.feather")
Now, you can use read_file
to load the data directly into a Woodwork DataFrame based on your typing information. This function uses the content_type
parameter to determine the file format. If content_type
is not specified, it will try to infer the file format from the file extension.
[8]:
woodwork_df = read_file(
filepath="retail/data.csv",
content_type="text/csv",
**typing_information,
)
woodwork_df = read_file(
filepath="retail/data.parquet",
content_type="application/parquet",
**typing_information,
)
woodwork_df = read_file(
filepath="retail/data.feather",
content_type="application/feather",
**typing_information,
)
woodwork_df.ww
[8]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
order_product_id | category | Categorical | ['index'] |
order_id | category | Categorical | ['category'] |
product_id | category | Categorical | ['category'] |
description | string | NaturalLanguage | [] |
quantity | int64 | Integer | ['numeric'] |
order_date | datetime64[ns] | Datetime | ['time_index'] |
unit_price | float64 | Double | ['numeric'] |
customer_name | category | Categorical | ['category'] |
country | category | Categorical | ['category'] |
total | float64 | Double | ['numeric'] |
cancelled | bool | Boolean | [] |