Saving and Loading DataFrames

In this guide, you will learn how to save and load Woodwork DataFrames.

Saving a Woodwork DataFrame

After defining a Woodwork DataFrame with the proper logical types and semantic tags, you can save the DataFrame and typing information by using DataFrame.ww.to_disk. By default, this method will create a directory that contains a data folder and a woodwork_typing_info.json file, but users have the ability to specify different values if needed. Refer to the API Guide for more information on the parameters that can be specified when using the to_disk method.

To illustrate, we will use this retail DataFrame which already comes configured with Woodwork typing information.

[1]:
from woodwork.demo import load_retail
df = load_retail(nrows=100)
df.ww.schema
[1]:
Logical Type Semantic Tag(s)
Column
order_product_id Categorical ['index']
order_id Categorical ['category']
product_id Categorical ['category']
description NaturalLanguage []
quantity Integer ['numeric']
order_date Datetime ['time_index']
unit_price Double ['numeric']
customer_name Categorical ['category']
country Categorical ['category']
total Double ['numeric']
cancelled Boolean []
[2]:
df.head()
[2]:
order_product_id order_id product_id description quantity order_date unit_price customer_name country total cancelled
0 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 4.2075 Andrea Brown United Kingdom 25.245 False
1 1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False
2 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 4.5375 Andrea Brown United Kingdom 36.300 False
3 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False
4 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False

From the ww acessor, use to_disk to save the Woodwork DataFrame.

[3]:
df.ww.to_disk('retail')

You should see a new directory that contains the data and typing information.

retail
├── data
│   └── demo_retail_data.csv
└── woodwork_typing_info.json

Data Directory

The data directory contains the underlying data written in the specified format. If the user does not specify a filename, the method derives the filename from DataFrame.ww.name and uses CSV as the default format. You can change the format by setting the method’s format parameter to any of the following formats:

  • csv (default)

  • pickle

  • parquet

  • arrow

  • feather

  • orc

Typing Information

In woodwork_typing_info.json, you can see all of the typing information and metadata associated with the DataFrame. This information includes:

  • the version of the schema at the time of saving the DataFrame

  • the DataFrame name specified by DataFrame.ww.name

  • the column names for the index and time index

  • the column typing information, which contains the logical types with their parameters and semantic tags for each column

  • the loading information required for the DataFrame type and file format

  • the table metadata provided by DataFrame.ww.metadata (must be JSON serializable)

{
    "schema_version": "10.0.2",
    "name": "demo_retail_data",
    "index": "order_product_id",
    "time_index": "order_date",
    "column_typing_info": [...],
    "loading_info": {
        "table_type": "pandas",
        "location": "data/demo_retail_data.csv",
        "type": "csv",
        "params": {
            "compression": null,
            "sep": ",",
            "encoding": "utf-8",
            "engine": "python",
            "index": false
        }
    },
    "table_metadata": {}
}

Loading a Woodwork DataFrame

After saving a Woodwork DataFrame, you can load the DataFrame and typing information by using woodwork.deserialize.from_disk. This function will use the stored typing information in the specified directory to recreate the Woodwork DataFrame.

If you have modified any of the default values for the filename, data subdirectory or typing information file, you will need to specify those when calling from_disk. Since we did not change any of the defaults for this example, we do not need to specify them here.

[4]:
from woodwork.deserialize import from_disk
df = from_disk('retail')
df.ww.schema
[4]:
Logical Type Semantic Tag(s)
Column
order_product_id Categorical ['index']
order_id Categorical ['category']
product_id Categorical ['category']
description NaturalLanguage []
quantity Integer ['numeric']
order_date Datetime ['time_index']
unit_price Double ['numeric']
customer_name Categorical ['category']
country Categorical ['category']
total Double ['numeric']
cancelled Boolean []

Loading the DataFrame and typing information separately

You can also use woodwork.read_file to load a Woodwork DataFrame without the typing information. This approach is helpful if you want to quickly get started and let Woodwork infer the typing information based on the underlying data. To illustrate, let’s read the CSV file from the previous example directly into a Woodwork DataFrame.

[5]:
from woodwork import read_file

df = read_file('retail/data/demo_retail_data.csv')
df.ww
[5]:
Physical Type Logical Type Semantic Tag(s)
Column
order_product_id int64 Integer ['numeric']
order_id int64 Integer ['numeric']
product_id string[pyarrow] Unknown []
description string[pyarrow] NaturalLanguage []
quantity int64 Integer ['numeric']
order_date datetime64[ns] Datetime []
unit_price float64 Double ['numeric']
customer_name category Categorical ['category']
country category Categorical ['category']
total float64 Double ['numeric']
cancelled bool Boolean []

The typing information is optional in read_file. So, you can still specify the typing information parameters to control how Woodwork gets initialized. To illustrate, we will read data files in different formats directly into a Woodwork DataFrame and use this typing information.

[6]:
typing_information = {
    'index': 'order_product_id',
    'time_index': 'order_date',
    'logical_types': {
        'order_product_id': 'Categorical',
        'order_id': 'Categorical',
        'product_id': 'Categorical',
        'description': 'NaturalLanguage',
        'quantity': 'Integer',
        'order_date': 'Datetime',
        'unit_price': 'Double',
        'customer_name': 'Categorical',
        'country': 'Categorical',
        'total': 'Double',
        'cancelled': 'Boolean',
    },
    'semantic_tags': {
        'order_id': {'category'},
        'product_id': {'category'},
        'quantity': {'numeric'},
        'unit_price': {'numeric'},
        'customer_name': {'category'},
        'country': {'category'},
        'total': {'numeric'},
    },
}

First, let’s create the data files in different formats from a pandas DataFrame.

[7]:
import pandas as pd

pandas_df = pd.read_csv('retail/data/demo_retail_data.csv')
pandas_df.to_csv('retail/data.csv')
pandas_df.to_parquet('retail/data.parquet')
pandas_df.to_feather('retail/data.feather')

Now, you can use read_file to load the data directly into a Woodwork DataFrame based on your typing information. This function uses the content_type parameter to determine the file format. If content_type is not specified, it will try to infer the file format from the file extension.

[8]:
woodwork_df = read_file(
    filepath='retail/data.csv',
    content_type='text/csv',
    **typing_information,
)

woodwork_df = read_file(
    filepath='retail/data.parquet',
    content_type='application/parquet',
    **typing_information,
)

woodwork_df = read_file(
    filepath='retail/data.feather',
    content_type='application/feather',
    **typing_information,
)

woodwork_df.ww
[8]:
Physical Type Logical Type Semantic Tag(s)
Column
order_product_id category Categorical ['index']
order_id category Categorical ['category']
product_id category Categorical ['category']
description string[pyarrow] NaturalLanguage []
quantity int64 Integer ['numeric']
order_date datetime64[ns] Datetime ['time_index']
unit_price float64 Double ['numeric']
customer_name category Categorical ['category']
country category Categorical ['category']
total float64 Double ['numeric']
cancelled bool Boolean []