In this guide, you will learn how to save and load Woodwork DataFrames.
After defining a Woodwork DataFrame with the proper logical types and semantic tags, you can save the DataFrame and typing information by using DataFrame.ww.to_disk. This method will create a directory that contains a data folder and a woodwork_typing_info.json file. To illustrate, we will use this retail DataFrame which already comes configured with Woodwork typing information.
DataFrame.ww.to_disk
data
woodwork_typing_info.json
[1]:
from woodwork.demo import load_retail df = load_retail(nrows=100) df.ww.schema
[2]:
df.head()
From the ww acessor, use to_disk to save the Woodwork DataFrame.
ww
to_disk
[3]:
df.ww.to_disk('retail')
You should see a new directory that contains the data and typing information.
retail ├── data │ └── demo_retail_data.csv └── woodwork_typing_info.json
The data directory contains the underlying data written in the specified format. The method derives the filename from DataFrame.ww.name and uses CSV as the default format. You can change the format by setting the method’s format parameter to any of the following formats:
DataFrame.ww.name
format
csv (default)
pickle
parquet
In woodwork_typing_info.json, you can see all of the typing information and metadata associated with the DataFrame. This information includes:
the version of the schema at the time of saving the DataFrame
the DataFrame name specified by DataFrame.ww.name
the column names for the index and time index
the column typing information, which contains the logical types with their parameters and semantic tags for each column
the loading information required for the DataFrame type and file format
the table metadata provided by DataFrame.ww.metadata (must be JSON serializable)
DataFrame.ww.metadata
{ "schema_version": "10.0.2", "name": "demo_retail_data", "index": "order_product_id", "time_index": "order_date", "column_typing_info": [...], "loading_info": { "table_type": "pandas", "location": "data/demo_retail_data.csv", "type": "csv", "params": { "compression": null, "sep": ",", "encoding": "utf-8", "engine": "python", "index": false } }, "table_metadata": {} }
After saving a Woodwork DataFrame, you can load the DataFrame and typing information by using woodwork.deserialize.read_woodwork_table. This function will use the stored typing information in the specified directory to recreate the Woodwork DataFrame.
woodwork.deserialize.read_woodwork_table
[4]:
from woodwork.deserialize import read_woodwork_table df = read_woodwork_table('retail') df.ww.schema
You can also use woodwork.read_file to load a Woodwork DataFrame without the typing information. This approach is helpful if you want to quickly get started and let Woodwork infer the typing information based on the underlying data. To illustrate, let’s read the CSV file from the previous example directly into a Woodwork DataFrame.
woodwork.read_file
[5]:
from woodwork import read_file df = read_file('retail/data/demo_retail_data.csv') df.ww
The typing information is optional in read_file. So, you can still specify the typing information parameters to control how Woodwork gets initialized. To illustrate, we will read data files in different formats directly into a Woodwork DataFrame and use this typing information.
read_file
[6]:
typing_information = { 'index': 'order_product_id', 'time_index': 'order_date', 'logical_types': { 'order_product_id': 'Categorical', 'order_id': 'Categorical', 'product_id': 'Categorical', 'description': 'NaturalLanguage', 'quantity': 'Integer', 'order_date': 'Datetime', 'unit_price': 'Double', 'customer_name': 'Categorical', 'country': 'Categorical', 'total': 'Double', 'cancelled': 'Boolean', }, 'semantic_tags': { 'order_id': {'category'}, 'product_id': {'category'}, 'quantity': {'numeric'}, 'unit_price': {'numeric'}, 'customer_name': {'category'}, 'country': {'category'}, 'total': {'numeric'}, }, }
First, let’s create the data files in different formats from a pandas DataFrame.
[7]:
import pandas as pd pandas_df = pd.read_csv('retail/data/demo_retail_data.csv') pandas_df.to_csv('retail/data.csv') pandas_df.to_parquet('retail/data.parquet') pandas_df.to_feather('retail/data.feather')
Now, you can use read_file to load the data directly into a Woodwork DataFrame based on your typing information. This function uses the content_type parameter to determine the file format. If content_type is not specified, it will try to infer the file format from the file extension.
content_type
[8]:
woodwork_df = read_file( filepath='retail/data.csv', content_type='text/csv', **typing_information, ) woodwork_df = read_file( filepath='retail/data.parquet', content_type='application/parquet', **typing_information, ) woodwork_df = read_file( filepath='retail/data.feather', content_type='application/feather', **typing_information, ) woodwork_df.ww