Saving and Loading DataFrames¶

In this guide, you will learn how to save and load Woodwork DataFrames.

Saving a Woodwork DataFrame¶

After defining a Woodwork DataFrame with the proper logical types and semantic tags, you can save the DataFrame and typing information by using DataFrame.ww.to_disk. By default, this method will create a directory that contains a data folder and a woodwork_typing_info.json file, but users have the ability to specify different values if needed. Refer to the API Guide for more information on the parameters that can be specified when using the to_disk method.

To illustrate, we will use this retail DataFrame which already comes configured with Woodwork typing information.

[1]:

from woodwork.demo import load_retail
df = load_retail(nrows=100)
df.ww.schema

[1]:

	Logical Type	Semantic Tag(s)
Column
order_product_id	Categorical	['index']
order_id	Categorical	['category']
product_id	Categorical	['category']
description	NaturalLanguage	[]
quantity	Integer	['numeric']
order_date	Datetime	['time_index']
unit_price	Double	['numeric']
customer_name	Categorical	['category']
country	Categorical	['category']
total	Double	['numeric']
cancelled	Boolean	[]

[2]:

df.head()

[2]:

	order_product_id	order_id	product_id	description	quantity	order_date	unit_price	customer_name	country	total	cancelled
0	0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	4.2075	Andrea Brown	United Kingdom	25.245	False
1	1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False
2	2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	4.5375	Andrea Brown	United Kingdom	36.300	False
3	3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False
4	4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False

From the ww acessor, use to_disk to save the Woodwork DataFrame.

[3]:

df.ww.to_disk('retail')

You should see a new directory that contains the data and typing information.

retail
├── data
│   └── demo_retail_data.csv
└── woodwork_typing_info.json

Data Directory¶

The data directory contains the underlying data written in the specified format. If the user does not specify a filename, the method derives the filename from DataFrame.ww.name and uses CSV as the default format. You can change the format by setting the method’s format parameter to any of the following formats:

csv (default)
pickle
parquet
arrow
feather
orc

Typing Information¶

In woodwork_typing_info.json, you can see all of the typing information and metadata associated with the DataFrame. This information includes:

the version of the schema at the time of saving the DataFrame
the DataFrame name specified by DataFrame.ww.name
the column names for the index and time index
the column typing information, which contains the logical types with their parameters and semantic tags for each column
the loading information required for the DataFrame type and file format
the table metadata provided by DataFrame.ww.metadata (must be JSON serializable)

{
    "schema_version": "10.0.2",
    "name": "demo_retail_data",
    "index": "order_product_id",
    "time_index": "order_date",
    "column_typing_info": [...],
    "loading_info": {
        "table_type": "pandas",
        "location": "data/demo_retail_data.csv",
        "type": "csv",
        "params": {
            "compression": null,
            "sep": ",",
            "encoding": "utf-8",
            "engine": "python",
            "index": false
        }
    },
    "table_metadata": {}
}

Loading a Woodwork DataFrame¶

After saving a Woodwork DataFrame, you can load the DataFrame and typing information by using woodwork.deserialize.from_disk. This function will use the stored typing information in the specified directory to recreate the Woodwork DataFrame.

If you have modified any of the default values for the filename, data subdirectory or typing information file, you will need to specify those when calling from_disk. Since we did not change any of the defaults for this example, we do not need to specify them here.

[4]:

from woodwork.deserialize import from_disk
df = from_disk('retail')
df.ww.schema

[4]:

	Logical Type	Semantic Tag(s)
Column
order_product_id	Categorical	['index']
order_id	Categorical	['category']
product_id	Categorical	['category']
description	NaturalLanguage	[]
quantity	Integer	['numeric']
order_date	Datetime	['time_index']
unit_price	Double	['numeric']
customer_name	Categorical	['category']
country	Categorical	['category']
total	Double	['numeric']
cancelled	Boolean	[]

Loading the DataFrame and typing information separately¶

You can also use woodwork.read_file to load a Woodwork DataFrame without the typing information. This approach is helpful if you want to quickly get started and let Woodwork infer the typing information based on the underlying data. To illustrate, let’s read the CSV file from the previous example directly into a Woodwork DataFrame.

[5]:

from woodwork import read_file

df = read_file('retail/data/demo_retail_data.csv')
df.ww

[5]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
order_product_id	int64	Integer	['numeric']
order_id	int64	Integer	['numeric']
product_id	string[pyarrow]	Unknown	[]
description	string[pyarrow]	NaturalLanguage	[]
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	category	Categorical	['category']
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	bool	Boolean	[]

The typing information is optional in read_file. So, you can still specify the typing information parameters to control how Woodwork gets initialized. To illustrate, we will read data files in different formats directly into a Woodwork DataFrame and use this typing information.

[6]:

typing_information = {
    'index': 'order_product_id',
    'time_index': 'order_date',
    'logical_types': {
        'order_product_id': 'Categorical',
        'order_id': 'Categorical',
        'product_id': 'Categorical',
        'description': 'NaturalLanguage',
        'quantity': 'Integer',
        'order_date': 'Datetime',
        'unit_price': 'Double',
        'customer_name': 'Categorical',
        'country': 'Categorical',
        'total': 'Double',
        'cancelled': 'Boolean',
    },
    'semantic_tags': {
        'order_id': {'category'},
        'product_id': {'category'},
        'quantity': {'numeric'},
        'unit_price': {'numeric'},
        'customer_name': {'category'},
        'country': {'category'},
        'total': {'numeric'},
    },
}

First, let’s create the data files in different formats from a pandas DataFrame.

[7]:

import pandas as pd

pandas_df = pd.read_csv('retail/data/demo_retail_data.csv')
pandas_df.to_csv('retail/data.csv')
pandas_df.to_parquet('retail/data.parquet')
pandas_df.to_feather('retail/data.feather')

Now, you can use read_file to load the data directly into a Woodwork DataFrame based on your typing information. This function uses the content_type parameter to determine the file format. If content_type is not specified, it will try to infer the file format from the file extension.

[8]:

woodwork_df = read_file(
    filepath='retail/data.csv',
    content_type='text/csv',
    **typing_information,
)

woodwork_df = read_file(
    filepath='retail/data.parquet',
    content_type='application/parquet',
    **typing_information,
)

woodwork_df = read_file(
    filepath='retail/data.feather',
    content_type='application/feather',
    **typing_information,
)

woodwork_df.ww

[8]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
order_product_id	category	Categorical	['index']
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	string[pyarrow]	NaturalLanguage	[]
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	['time_index']
unit_price	float64	Double	['numeric']
customer_name	category	Categorical	['category']
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	bool	Boolean	[]

Customizing Logical Types and Type Inference

API Reference