Get Started#

In this guide, you walk through examples where you initialize Woodwork on a DataFrame and on a Series. Along the way, you learn how to update and remove logical types and semantic tags. You also learn how to use typing information to select subsets of data.

Types and Tags#

Woodwork relies heavily on the concepts of physical types, logical types and semantic tags. These concepts are covered in detail in Working with Types and Tags, but we provide brief definitions here for reference:

Physical Type: defines how the data is stored on disk or in memory.
Logical Type: defines how the data should be parsed or interpreted.
Semantic Tag(s): provides additional data about the meaning of the data or how it should be used.

Start learning how to use Woodwork by reading in a dataframe that contains retail sales data.

[1]:

import pandas as pd

df = pd.read_csv("https://oss.alteryx.com/datasets/online-retail-logs-2018-08-28.csv")
df["order_product_id"] = range(df.shape[0])
df.head(5)

[1]:

	order_id	product_id	description	quantity	order_date	unit_price	customer_name	country	total	cancelled	order_product_id
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	4.2075	Andrea Brown	United Kingdom	25.245	False	0
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False	1
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	4.5375	Andrea Brown	United Kingdom	36.300	False	2
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False	3
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False	4

As you can see, this is a dataframe containing several different data types, including dates, categorical values, numeric values, and natural language descriptions. Next, initialize Woodwork on this DataFrame.

Initializing Woodwork on a DataFrame#

Importing Woodwork creates a special namespace on your DataFrames, DataFrame.ww, that can be used to set or update the typing information for the DataFrame. As long as Woodwork has been imported, initializing Woodwork on a DataFrame is as simple as calling .ww.init() on the DataFrame of interest. An optional name parameter can be specified to label the data.

[2]:

import woodwork as ww

df.ww.init(name="retail")
df.ww

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(

[2]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category']
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	category	Categorical	['category']
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	bool	Boolean	[]
order_product_id	int64	Integer	['numeric']

Using just this simple call, Woodwork was able to infer the logical types present in the data by analyzing the DataFrame dtypes as well as the information contained in the columns. In addition, Woodwork also added semantic tags to some of the columns based on the logical types that were inferred.

Warning

Woodwork uses a weak reference for maintaining a reference from the accessor to the DataFrame. Because of this, chaining a Woodwork call onto another call that creates a new DataFrame or Series object can be problematic.

Instead of calling pd.DataFrame({'id':[1, 2, 3]}).ww.init(), first store the DataFrame in a new variable and then initialize Woodwork:

df = pd.DataFrame({'id':[1, 2, 3]})
df.ww.init()

All Woodwork methods and properties can be accessed through the ww namespace on the DataFrame. DataFrame methods called from the Woodwork namespace will be passed to the DataFrame, and whenever possible, Woodwork will be initialized on the returned object, assuming it is a Series or a DataFrame.

As an example, use the head method to create a new DataFrame containing the first 5 rows of the original data, with Woodwork typing information retained.

[3]:

head_df = df.ww.head(5)
head_df.ww

[3]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category']
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	category	Categorical	['category']
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	bool	Boolean	[]
order_product_id	int64	Integer	['numeric']

[4]:

head_df

[4]:

	order_id	product_id	description	quantity	order_date	unit_price	customer_name	country	total	cancelled	order_product_id
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	4.2075	Andrea Brown	United Kingdom	25.245	False	0
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False	1
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	4.5375	Andrea Brown	United Kingdom	36.300	False	2
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False	3
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False	4

Note

Once Woodwork is initialized on a DataFrame, it is recommended to go through the ww namespace when performing DataFrame operations to avoid invalidating Woodwork’s typing information.

Updating Logical Types#

If the initial inference was not to our liking, the logical type can be changed to a more appropriate value. Let’s change some of the columns to a different logical type to illustrate this process. In this case, set the logical type for the order_product_id and country columns to be Categorical and set customer_name to have a logical type of PersonFullName.

[5]:

df.ww.set_types(
    logical_types={
        "customer_name": "PersonFullName",
        "country": "Categorical",
        "order_product_id": "Categorical",
    }
)
df.ww.types

[5]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category']
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	string	PersonFullName	[]
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	bool	Boolean	[]
order_product_id	category	Categorical	['category']

Inspect the information in the types output. There, you can see that the Logical type for the three columns has been updated with the logical types you specified.

Selecting Columns#

Now that you’ve prepared logical types, you can select a subset of the columns based on their logical types. Select only the columns that have a logical type of Integer or Double.

[6]:

numeric_df = df.ww.select(["Integer", "Double"])
numeric_df.ww

[6]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
quantity	int64	Integer	['numeric']
unit_price	float64	Double	['numeric']
total	float64	Double	['numeric']

This selection process has returned a new Woodwork DataFrame containing only the columns that match the logical types you specified. After you have selected the columns you want, you can use the DataFrame containing just those columns as you normally would for any additional analysis.

[7]:

numeric_df

[7]:

	quantity	unit_price	total
0	6	4.2075	25.2450
1	6	5.5935	33.5610
2	8	4.5375	36.3000
3	6	5.5935	33.5610
4	6	5.5935	33.5610
...	...	...	...
401599	12	1.4025	16.8300
401600	6	3.4650	20.7900
401601	4	6.8475	27.3900
401602	4	6.8475	27.3900
401603	3	8.1675	24.5025

401604 rows × 3 columns

Adding Semantic Tags#

Next, let’s add semantic tags to some of the columns. Add the tag of product_details to the description column, and tag the total column with currency.

[8]:

df.ww.set_types(semantic_tags={"description": "product_details", "total": "currency"})
df.ww

[8]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category', 'product_details']
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	string	PersonFullName	[]
country	category	Categorical	['category']
total	float64	Double	['currency', 'numeric']
cancelled	bool	Boolean	[]
order_product_id	category	Categorical	['category']

Select columns based on a semantic tag. Only select the columns tagged with category.

[9]:

category_df = df.ww.select("category")
category_df.ww

[9]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category', 'product_details']
country	category	Categorical	['category']
order_product_id	category	Categorical	['category']

Select columns using multiple semantic tags or a mixture of semantic tags and logical types.

[10]:

category_numeric_df = df.ww.select(["numeric", "category"])
category_numeric_df.ww

[10]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category', 'product_details']
quantity	int64	Integer	['numeric']
unit_price	float64	Double	['numeric']
country	category	Categorical	['category']
total	float64	Double	['currency', 'numeric']
order_product_id	category	Categorical	['category']

[11]:

mixed_df = df.ww.select(["Boolean", "product_details"])
mixed_df.ww

[11]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
description	category	Categorical	['category', 'product_details']
cancelled	bool	Boolean	[]

To select an individual column, specify the column name. Woodwork will be initialized on the returned Series and you can use the Series for additional analysis as needed.

[12]:

total = df.ww["total"]
total.ww

[12]:

<Series: total (Physical Type = float64) (Logical Type = Double) (Semantic Tags = {'currency', 'numeric'})>

[13]:

total

[13]:

0         25.2450
1         33.5610
2         36.3000
3         33.5610
4         33.5610
           ...
401599    16.8300
401600    20.7900
401601    27.3900
401602    27.3900
401603    24.5025
Name: total, Length: 401604, dtype: float64

Select multiple columns by supplying a list of column names.

[14]:

multiple_cols_df = df.ww[["product_id", "total", "unit_price"]]
multiple_cols_df.ww

[14]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
product_id	category	Categorical	['category']
total	float64	Double	['currency', 'numeric']
unit_price	float64	Double	['numeric']

Removing Semantic Tags#

Remove specific semantic tags from a column if they are no longer needed. In this example, remove the product_details tag from the description column.

[15]:

df.ww.remove_semantic_tags({"description": "product_details"})
df.ww

[15]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category']
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	string	PersonFullName	[]
country	category	Categorical	['category']
total	float64	Double	['currency', 'numeric']
cancelled	bool	Boolean	[]
order_product_id	category	Categorical	['category']

Notice how the product_details tag has been removed from the description column. If you want to remove all user-added semantic tags from all columns, you can do that, too.

[16]:

df.ww.reset_semantic_tags()
df.ww

[16]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category']
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	string	PersonFullName	[]
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	bool	Boolean	[]
order_product_id	category	Categorical	['category']

Set Index and Time Index#

At any point, you can designate certain columns as the Woodwork index or time_index with the methods set_index and set_time_index. These methods can be used to assign these columns for the first time or to change the column being used as the index or time index.

Index and time index columns contain index and time_index semantic tags, respectively.

[17]:

df.ww.set_index("order_product_id")
df.ww.index

[17]:

'order_product_id'

[18]:

df.ww.set_time_index("order_date")
df.ww.time_index

[18]:

'order_date'

[19]:

df.ww

[19]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	category	Categorical	['category']
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	['time_index']
unit_price	float64	Double	['numeric']
customer_name	string	PersonFullName	[]
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	bool	Boolean	[]
order_product_id	category	Categorical	['index']

Using Woodwork with a Series#

Woodwork also can be used to store typing information on a Series. There are two approaches for initializing Woodwork on a Series, depending on whether or not the Series dtype is the same as the physical type associated with the LogicalType. For more information on logical types and physical types, refer to Working with Types and Tags.

If your Series dtype matches the physical type associated with the specified or inferred LogicalType, Woodwork can be initialized through the ww namespace, just as with DataFrames.

[20]:

series = pd.Series([1, 2, 3], dtype="int64")
series.ww.init(logical_type="Integer")
series.ww

[20]:

<Series: None (Physical Type = int64) (Logical Type = Integer) (Semantic Tags = {'numeric'})>

In the example above, we specified the Integer LogicalType for the Series. Because Integer has a physical type of int64 and this matches the dtype used to create the Series, no Series dtype conversion was needed and the initialization succeeds.

In cases where the LogicalType requires the Series dtype to change, a helper function ww.init_series must be used. This function will return a new Series object with Woodwork initialized and the dtype of the series changed to match the physical type of the LogicalType.

To demonstrate this case, first create a Series, with a string dtype. Then, initialize a Woodwork Series with a Categorical logical type using the init_series function. Because Categorical uses a physical type of category, the dtype of the Series must be changed, and that is why we must use the init_series function here.

The series that is returned will have Woodwork initialized with the LogicalType set to Categorical as expected, with the expected dtype of category.

[21]:

string_series = pd.Series(["a", "b", "a"], dtype="string")
ww_series = ww.init_series(string_series, logical_type="Categorical")
ww_series.ww

[21]:

<Series: None (Physical Type = category) (Logical Type = Categorical) (Semantic Tags = {'category'})>

As with DataFrames, Woodwork provides several methods that can be used to update or change the typing information associated with the series. As an example, add a new semantic tag to the series.

[22]:

series.ww.add_semantic_tags("new_tag")
series.ww

[22]:

<Series: None (Physical Type = int64) (Logical Type = Integer) (Semantic Tags = {'new_tag', 'numeric'})>

As you can see from the output above, the specified tag has been added to the semantic tags for the series.

You can also access Series properties methods through the Woodwork namespace. When possible, Woodwork typing information will be retained on the value returned. As an example, you can access the Series shape property through Woodwork.

[23]:

series.ww.shape

[23]:

(3,)

You can also call Series methods such as sample. In this case, Woodwork typing information is retained on the Series returned by the sample method.

[24]:

sample_series = series.ww.sample(2)
sample_series.ww

[24]:

<Series: None (Physical Type = int64) (Logical Type = Integer) (Semantic Tags = {'numeric', 'new_tag'})>

[25]:

sample_series

[25]:

1    2
2    3
dtype: int64

List Logical Types#

Retrieve all the Logical Types present in Woodwork. These can be useful for understanding the Logical Types, as well as how they are interpreted.

[26]:

from woodwork.type_sys.utils import list_logical_types

list_logical_types()

[26]:

	name	type_string	description	physical_type	standard_tags	is_default_type	is_registered	parent_type
0	Address	address	Represents Logical Types that contain address ...	string	{}	True	True	None
1	Age	age	Represents Logical Types that contain whole nu...	int64	{numeric}	True	True	Integer
2	AgeFractional	age_fractional	Represents Logical Types that contain non-nega...	float64	{numeric}	True	True	Double
3	AgeNullable	age_nullable	Represents Logical Types that contain whole nu...	Int64	{numeric}	True	True	IntegerNullable
4	Boolean	boolean	Represents Logical Types that contain binary v...	bool	{}	True	True	BooleanNullable
5	BooleanNullable	boolean_nullable	Represents Logical Types that contain binary v...	boolean	{}	True	True	None
6	Categorical	categorical	Represents Logical Types that contain unordere...	category	{category}	True	True	None
7	CountryCode	country_code	Represents Logical Types that use the ISO-3166...	category	{category}	True	True	Categorical
8	CurrencyCode	currency_code	Represents Logical Types that use the ISO-4217...	category	{category}	True	True	Categorical
9	Datetime	datetime	Represents Logical Types that contain date and...	datetime64[ns]	{}	True	True	None
10	Double	double	Represents Logical Types that contain positive...	float64	{numeric}	True	True	None
11	EmailAddress	email_address	Represents Logical Types that contain email ad...	string	{}	True	True	Unknown
12	Filepath	filepath	Represents Logical Types that specify location...	string	{}	True	True	None
13	IPAddress	ip_address	Represents Logical Types that contain IP addre...	string	{}	True	True	Unknown
14	Integer	integer	Represents Logical Types that contain positive...	int64	{numeric}	True	True	IntegerNullable
15	IntegerNullable	integer_nullable	Represents Logical Types that contain positive...	Int64	{numeric}	True	True	None
16	LatLong	lat_long	Represents Logical Types that contain latitude...	object	{}	True	True	None
17	NaturalLanguage	natural_language	Represents Logical Types that contain text or ...	string	{}	True	True	None
18	Ordinal	ordinal	Represents Logical Types that contain ordered ...	category	{category}	True	True	Categorical
19	PersonFullName	person_full_name	Represents Logical Types that may contain firs...	string	{}	True	True	None
20	PhoneNumber	phone_number	Represents Logical Types that contain numeric ...	string	{}	True	True	Unknown
21	PostalCode	postal_code	Represents Logical Types that contain a series...	category	{category}	True	True	Categorical
22	SubRegionCode	sub_region_code	Represents Logical Types that use the ISO-3166...	category	{category}	True	True	Categorical
23	Timedelta	timedelta	Represents Logical Types that contain values s...	timedelta64[ns]	{}	True	True	Unknown
24	URL	url	Represents Logical Types that contain URLs, wh...	string	{}	True	True	Unknown
25	Unknown	unknown	Represents Logical Types that cannot be inferr...	string	{}	True	True	None