Get Started — Woodwork 0.0.11 documentation

Types and Tags¶

Woodwork relies heavily on the concepts of physical types, logical types and semantic tags. These concepts are covered in detail in Understanding Types and Tags, but we provide brief definitions here for reference:

Physical Type: defines how the data is stored on disk or in memory.
Logical Type: defines how the data should be parsed or interpreted.
Semantic Tag(s): provides additional data about the meaning of the data or how it should be used.

Start learning how to use Woodwork by creating a dataframe that contains retail sales data.

[1]:

import woodwork as ww

data = ww.demo.load_retail(nrows=100, return_dataframe=True)
data.head(5)

[1]:

	order_product_id	order_id	product_id	description	quantity	order_date	unit_price	customer_name	country	total	cancelled
0	0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	4.2075	Andrea Brown	United Kingdom	25.245	False
1	1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False
2	2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	4.5375	Andrea Brown	United Kingdom	36.300	False
3	3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False
4	4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False

As you can see, this is a dataframe containing several different data types, including dates, categorical values, numeric values, and natural language descriptions. Next, use Woodwork to create a DataTable from this data.

Creating a DataTable¶

Creating a Woodwork DataTable is as simple as passing in a dataframe with the data of interest during initialization. An optional name parameter can be specified to label the DataTable.

[2]:

dt = ww.DataTable(data, name="retail")
dt

[2]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
order_product_id	Int64	Integer	['numeric']
order_id	Int64	Integer	['numeric']
product_id	category	Categorical	['category']
description	string	NaturalLanguage	[]
quantity	Int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	string	NaturalLanguage	[]
country	string	NaturalLanguage	[]
total	float64	Double	['numeric']
cancelled	boolean	Boolean	[]

Using just this simple call, Woodwork was able to infer the logical types present in the data by analyzing the dataframe dtypes as well as the information contained in the columns. In addition, Woodwork also added semantic tags to some of the columns based on the logical types that were inferred.

You can also view the typing information along with the first few columns of data.

[3]:

dt.head()

[3]:

Data Column	order_product_id	order_id	product_id	description	quantity	order_date	unit_price	customer_name	country	total	cancelled
Physical Type	Int64	Int64	category	string	Int64	datetime64[ns]	float64	string	string	float64	boolean
Logical Type	Integer	Integer	Categorical	NaturalLanguage	Integer	Datetime	Double	NaturalLanguage	NaturalLanguage	Double	Boolean
Semantic Tag(s)	['numeric']	['numeric']	['category']	[]	['numeric']	[]	['numeric']	[]	[]	['numeric']	[]
0	0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	4.2075	Andrea Brown	United Kingdom	25.245	False
1	1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False
2	2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	4.5375	Andrea Brown	United Kingdom	36.300	False
3	3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False
4	4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	5.5935	Andrea Brown	United Kingdom	33.561	False

Updating Logical Types¶

If the initial inference was not to our liking, the logical type can be changed to a more appropriate value. Let’s change some of the columns to a different logical type to illustrate this process. In this case, set the logical type for the quantity, customer_name, and country columns to be Categorical.

[4]:

dt = dt.set_types(logical_types={
    'quantity': 'Categorical',
    'customer_name': 'Categorical',
    'country': 'Categorical'
})
dt

[4]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
order_product_id	Int64	Integer	['numeric']
order_id	Int64	Integer	['numeric']
product_id	category	Categorical	['category']
description	string	NaturalLanguage	[]
quantity	category	Categorical	['category']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	category	Categorical	['category']
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	boolean	Boolean	[]

Inspect the information in the types output. There, you can see that the Logical type for the three columns has been updated with the Categorical logical type you specified.

Selecting Columns¶

Now that you’ve prepared logical types, you can select a subset of the columns based on their logical types. Select only the columns that have a logical type of Integer or Double.

[5]:

numeric_dt = dt.select(['Integer', 'Double'])
numeric_dt

[5]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
order_product_id	Int64	Integer	['numeric']
order_id	Int64	Integer	['numeric']
unit_price	float64	Double	['numeric']
total	float64	Double	['numeric']

This selection process has returned a new DataTable containing only the columns that match the logical types you specified. After you have selected the columns you want, you can also access a dataframe containing just those columns if you need it for additional analysis.

[6]:

numeric_dt.to_dataframe()

[6]:

	order_product_id	order_id	unit_price	total
0	0	536365	4.2075	25.245
1	1	536365	5.5935	33.561
2	2	536365	4.5375	36.300
3	3	536365	5.5935	33.561
4	4	536365	5.5935	33.561
...	...	...	...	...
95	95	536378	4.2075	25.245
96	96	536378	0.6930	83.160
97	97	536378	0.9075	21.780
98	98	536378	0.9075	21.780
99	99	536378	0.9075	21.780

100 rows × 4 columns

Note

Accessing the dataframe associated with a DataTable by using dt.to_dataframe() returns a reference to the dataframe. Modifications to the returned dataframe can cause unexpected results. If you need to modify the dataframe, you should use dt.to_dataframe().copy() to return a copy of the stored dataframe that can be safely modified without impacting the DataTable behavior.

Adding Semantic Tags¶

Next, let’s add semantic tags to some of the columns. Add the tag of product_details to the description column, and tag the total column with currency.

[7]:

dt = dt.set_types(semantic_tags={'description':'product_details', 'total': 'currency'})
dt

[7]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
order_product_id	Int64	Integer	['numeric']
order_id	Int64	Integer	['numeric']
product_id	category	Categorical	['category']
description	string	NaturalLanguage	['product_details']
quantity	category	Categorical	['category']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	category	Categorical	['category']
country	category	Categorical	['category']
total	float64	Double	['numeric', 'currency']
cancelled	boolean	Boolean	[]

Select columns based on a semantic tag. Only select the columns tagged with category.

[8]:

category_dt = dt.select('category')
category_dt

[8]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
product_id	category	Categorical	['category']
quantity	category	Categorical	['category']
customer_name	category	Categorical	['category']
country	category	Categorical	['category']

Select columns using multiple semantic tags or a mixture of semantic tags and logical types.

[9]:

category_numeric_dt = dt.select(['numeric', 'category'])
category_numeric_dt

[9]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
order_product_id	Int64	Integer	['numeric']
order_id	Int64	Integer	['numeric']
product_id	category	Categorical	['category']
quantity	category	Categorical	['category']
unit_price	float64	Double	['numeric']
customer_name	category	Categorical	['category']
country	category	Categorical	['category']
total	float64	Double	['numeric', 'currency']

[10]:

mixed_dt = dt.select(['Boolean', 'product_details'])
mixed_dt

[10]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
description	string	NaturalLanguage	['product_details']
cancelled	boolean	Boolean	[]

To select an individual column, specify the column name. You can then get access to the data in the DataColumn using the to_series method.

[11]:

dc = dt['total']
dc

[11]:

<DataColumn: total (Physical Type = float64) (Logical Type = Double) (Semantic Tags = {'numeric', 'currency'})>

[12]:

dc.to_series()

[12]:

0     25.245
1     33.561
2     36.300
3     33.561
4     33.561
       ...
95    25.245
96    83.160
97    21.780
98    21.780
99    21.780
Name: total, Length: 100, dtype: float64

Access multiple columns by supplying a list of column names.

[13]:

multiple_cols_dt = dt[['product_id', 'total', 'unit_price']]
multiple_cols_dt

[13]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
product_id	category	Categorical	['category']
total	float64	Double	['numeric', 'currency']
unit_price	float64	Double	['numeric']

Removing Semantic Tags¶

Remove specific semantic tags from a column if they are no longer needed. In this example, remove the product_details tag from the description column.

[14]:

dt = dt.remove_semantic_tags({'description':'product_details'})
dt

[14]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
order_product_id	Int64	Integer	['numeric']
order_id	Int64	Integer	['numeric']
product_id	category	Categorical	['category']
description	string	NaturalLanguage	[]
quantity	category	Categorical	['category']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	category	Categorical	['category']
country	category	Categorical	['category']
total	float64	Double	['numeric', 'currency']
cancelled	boolean	Boolean	[]

Notice how the product_details tag has been removed from the description column. If you want to remove all user-added semantic tags from all columns, you can do that, too.

[15]:

dt = dt.reset_semantic_tags()
dt

[15]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
order_product_id	Int64	Integer	['numeric']
order_id	Int64	Integer	['numeric']
product_id	category	Categorical	['category']
description	string	NaturalLanguage	[]
quantity	category	Categorical	['category']
order_date	datetime64[ns]	Datetime	[]
unit_price	float64	Double	['numeric']
customer_name	category	Categorical	['category']
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	boolean	Boolean	[]

Set Index and Time Index¶

At any point, you can designate certain columns as the DataTable’s index or time_index with the methods set_index and set_time_index. These methods can be used to assign these columns for the first time or to change the column being used as the index or time index.

Index and time index columns contain index and time_index semantic tags, respectively.

[16]:

dt = dt.set_index('order_product_id')
dt.index

[16]:

'order_product_id'

[17]:

dt = dt.set_time_index('order_date')
dt.time_index

[17]:

'order_date'

[18]:

dt

[18]:

	Physical Type	Logical Type	Semantic Tag(s)
Data Column
order_product_id	Int64	Integer	['index']
order_id	Int64	Integer	['numeric']
product_id	category	Categorical	['category']
description	string	NaturalLanguage	[]
quantity	category	Categorical	['category']
order_date	datetime64[ns]	Datetime	['time_index']
unit_price	float64	Double	['numeric']
customer_name	category	Categorical	['category']
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	boolean	Boolean	[]

List Logical Types¶

Retrieve all the Logical Types present in Woodwork. These can be useful for understanding the Logical Types, as well as how they are interpreted.

[19]:

from woodwork.type_sys.utils import list_logical_types

list_logical_types()

[19]:

	name	type_string	description	physical_type	standard_tags	is_default_type	is_registered	parent_type
0	Boolean	boolean	Represents Logical Types that contain binary v...	boolean	{}	True	True	None
1	Categorical	categorical	Represents Logical Types that contain unordere...	category	{category}	True	True	None
2	CountryCode	country_code	Represents Logical Types that contain categori...	category	{category}	True	True	Categorical
3	Datetime	datetime	Represents Logical Types that contain date and...	datetime64[ns]	{}	True	True	None
4	Double	double	Represents Logical Types that contain positive...	float64	{numeric}	True	True	None
5	EmailAddress	email_address	Represents Logical Types that contain email ad...	string	{}	True	True	NaturalLanguage
6	Filepath	filepath	Represents Logical Types that specify location...	string	{}	True	True	NaturalLanguage
7	FullName	full_name	Represents Logical Types that may contain firs...	string	{}	True	True	NaturalLanguage
8	IPAddress	ip_address	Represents Logical Types that contain IP addre...	string	{}	True	True	NaturalLanguage
9	Integer	integer	Represents Logical Types that contain positive...	Int64	{numeric}	True	True	None
10	LatLong	lat_long	Represents Logical Types that contain latitude...	object	{}	True	True	None
11	NaturalLanguage	natural_language	Represents Logical Types that contain text or ...	string	{}	True	True	None
12	Ordinal	ordinal	Represents Logical Types that contain ordered ...	category	{category}	True	True	Categorical
13	PhoneNumber	phone_number	Represents Logical Types that contain numeric ...	string	{}	True	True	NaturalLanguage
14	SubRegionCode	sub_region_code	Represents Logical Types that contain codes re...	category	{category}	True	True	Categorical
15	Timedelta	timedelta	Represents Logical Types that contain values s...	timedelta64[ns]	{}	True	True	None
16	URL	url	Represents Logical Types that contain URLs, wh...	string	{}	True	True	NaturalLanguage
17	ZIPCode	zip_code	Represents Logical Types that contain a series...	category	{category}	True	True	Categorical