Home#

Woodwork

Woodwork is a library that helps with data typing of 2-dimensional tabular data structures.#

It provides a special namespace on your DataFrame, ww, which contains the physical, logical, and semantic data types. It can be used with Featuretools, EvalML, and general machine learning applications where logical and semantic typing information is important.

Woodwork provides simple interfaces for adding and updating logical and semantic typing information, as well as selecting data columns based on the types.

Quick Start#

Below is an example of using Woodwork to automatically infer the Logical Types for a DataFrame and select columns with specific types.

[1]:
import woodwork as ww

df = ww.demo.load_retail(nrows=100, init_woodwork=False)

df.ww.init(name="retail")
df.ww
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
[1]:
Physical Type Logical Type Semantic Tag(s)
Column
order_product_id int64 Integer ['numeric']
order_id int64 Integer ['numeric']
product_id string Unknown []
description string NaturalLanguage []
quantity int64 Integer ['numeric']
order_date datetime64[ns] Datetime []
unit_price float64 Double ['numeric']
customer_name category Categorical ['category']
country category Categorical ['category']
total float64 Double ['numeric']
cancelled bool Boolean []
[2]:
filtered_df = df.ww.select(include=["numeric", "Boolean"])
filtered_df.head(5)
[2]:
order_product_id order_id quantity unit_price total cancelled
0 0 536365 6 4.2075 25.245 False
1 1 536365 6 5.5935 33.561 False
2 2 536365 8 4.5375 36.300 False
3 3 536365 6 5.5935 33.561 False
4 4 536365 6 5.5935 33.561 False

Table of contents#