Start

In this guide, we will walk through an example of creating a Woodwork DataTable, and will show how to update and remove logical types and semantic tags. We will also demonstrate how to use the typing information to select subsets of data.

Types and Tags

Woodwork relies heavily on the concepts of physical types, logical types and semantic tags. These concepts are covered in detail in Understanding Types and Tags, but brief definitions of each are provided here for reference:

  • Physical Type: defines how the data is stored on disk or in memory

  • Logical Type: defines how the data should be parsed or interpreted

  • Semantic Tag(s): provides additional data about the meaning of the data or how it should be used

Let’s demonstrate how to use Woodwork, starting off by creating a dataframe containing retail sales data.

[1]:
import woodwork as ww

data = ww.demo.load_retail(nrows=100, return_dataframe=True)
data.head(5)
[1]:
order_product_id order_id product_id description quantity order_date unit_price customer_name country total cancelled
0 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 4.2075 Andrea Brown United Kingdom 25.245 False
1 1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False
2 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 4.5375 Andrea Brown United Kingdom 36.300 False
3 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False
4 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False

As we can see, this is a dataframe containing several different data types, including dates, categorical values, numeric values and natural language descriptions. Let’s use Woodwork to create a DataTable from this data.

Creating a DataTable

Creating a Woodwork DataTable is as simple as passing in a dataframe with the data of interest during initialization. An optional name parameter can be specified to label the DataTable.

[2]:
dt = ww.DataTable(data, name="retail")
dt
[2]:
Physical Type Logical Type Semantic Tag(s)
Data Column
order_product_id Int64 Integer ['numeric']
order_id Int64 Integer ['numeric']
product_id category Categorical ['category']
description string NaturalLanguage []
quantity Int64 Integer ['numeric']
order_date datetime64[ns] Datetime []
unit_price float64 Double ['numeric']
customer_name string NaturalLanguage []
country string NaturalLanguage []
total float64 Double ['numeric']
cancelled boolean Boolean []

Using just this simple call, Woodwork was able to infer the logical types present in our data by analyzing the dataframe dtypes as well as the information contained in the columns. In addition, Woodwork also added semantic tags to some of the columns based on the logical types that were inferred.

We can also view the typing information with the first few columns of data with the following:

[3]:
dt.head()
[3]:
Data Column order_product_id order_id product_id description quantity order_date unit_price customer_name country total cancelled
Physical Type Int64 Int64 category string Int64 datetime64[ns] float64 string string float64 boolean
Logical Type Integer Integer Categorical NaturalLanguage Integer Datetime Double NaturalLanguage NaturalLanguage Double Boolean
Semantic Tag(s) ['numeric'] ['numeric'] ['category'] [] ['numeric'] [] ['numeric'] [] [] ['numeric'] []
0 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 4.2075 Andrea Brown United Kingdom 25.245 False
1 1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False
2 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 4.5375 Andrea Brown United Kingdom 36.300 False
3 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False
4 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False

Updating Logical Types

If the initial inference was not to our liking, the logical type can be changed to a more appropriate value. Let’s change some of the columns to a different logical type to illustrate this process. Below we will set the logical type for the quantity, customer_name and country columns to be Categorical.

[4]:
dt = dt.set_types(logical_types={
    'quantity': 'Categorical',
    'customer_name': 'Categorical',
    'country': 'Categorical'
})
dt
[4]:
Physical Type Logical Type Semantic Tag(s)
Data Column
order_product_id Int64 Integer ['numeric']
order_id Int64 Integer ['numeric']
product_id category Categorical ['category']
description string NaturalLanguage []
quantity category Categorical ['category']
order_date datetime64[ns] Datetime []
unit_price float64 Double ['numeric']
customer_name category Categorical ['category']
country category Categorical ['category']
total float64 Double ['numeric']
cancelled boolean Boolean []

If we now inspect the information in the types output, we can see that the Logical type for the three columns has been updated with the Categorical logical type we specified.

Selecting Columns

Now that we have logical types we are happy with, we can select a subset of the columns based on their logical types. Let’s select only the columns that have a logical type of Integer or Double:

[5]:
numeric_dt = dt.select(['Integer', 'Double'])
numeric_dt
[5]:
Physical Type Logical Type Semantic Tag(s)
Data Column
order_product_id Int64 Integer ['numeric']
order_id Int64 Integer ['numeric']
unit_price float64 Double ['numeric']
total float64 Double ['numeric']

This selection process has returned a new DataTable containing only the columns that match the logical types we specified. After we have selected the columns we want, we can also access a dataframe containing just those columns if we need it for additional analysis.

[6]:
numeric_dt.to_dataframe()
[6]:
order_product_id order_id unit_price total
0 0 536365 4.2075 25.245
1 1 536365 5.5935 33.561
2 2 536365 4.5375 36.300
3 3 536365 5.5935 33.561
4 4 536365 5.5935 33.561
... ... ... ... ...
95 95 536378 4.2075 25.245
96 96 536378 0.6930 83.160
97 97 536378 0.9075 21.780
98 98 536378 0.9075 21.780
99 99 536378 0.9075 21.780

100 rows × 4 columns

Note

Accessing the dataframe associated with a DataTable by using dt.to_dataframe() will return a reference to the dataframe. Modifications to the returned dataframe can cause unexpected results. If you need to modify the dataframe, you should use dt.to_dataframe().copy() to return a copy of the stored dataframe that can be safely modified without impacting the DataTable behavior.

Adding Semantic Tags

Next, let’s add semantic tags to some of the columns. We will add the tag of product_details to the description column and tag the total column with currency.

[7]:
dt = dt.set_types(semantic_tags={'description':'product_details', 'total': 'currency'})
dt
[7]:
Physical Type Logical Type Semantic Tag(s)
Data Column
order_product_id Int64 Integer ['numeric']
order_id Int64 Integer ['numeric']
product_id category Categorical ['category']
description string NaturalLanguage ['product_details']
quantity category Categorical ['category']
order_date datetime64[ns] Datetime []
unit_price float64 Double ['numeric']
customer_name category Categorical ['category']
country category Categorical ['category']
total float64 Double ['numeric', 'currency']
cancelled boolean Boolean []

We can also select columns based on a semantic tag. Perhaps we want to only select the columns tagged with category:

[8]:
category_dt = dt.select('category')
category_dt
[8]:
Physical Type Logical Type Semantic Tag(s)
Data Column
product_id category Categorical ['category']
quantity category Categorical ['category']
customer_name category Categorical ['category']
country category Categorical ['category']

We can also select columns using mutiple semantic tags, or even a mixture of semantic tags and logical types:

[9]:
category_numeric_dt = dt.select(['numeric', 'category'])
category_numeric_dt
[9]:
Physical Type Logical Type Semantic Tag(s)
Data Column
order_product_id Int64 Integer ['numeric']
order_id Int64 Integer ['numeric']
product_id category Categorical ['category']
quantity category Categorical ['category']
unit_price float64 Double ['numeric']
customer_name category Categorical ['category']
country category Categorical ['category']
total float64 Double ['numeric', 'currency']
[10]:
mixed_dt = dt.select(['Boolean', 'product_details'])
mixed_dt
[10]:
Physical Type Logical Type Semantic Tag(s)
Data Column
description string NaturalLanguage ['product_details']
cancelled boolean Boolean []

If we wanted to select an individual column, we just need to specify the column name. We can then get access to the data in the DataColumn using the to_series method:

[11]:
dc = dt['total']
dc
[11]:
<DataColumn: total (Physical Type = float64) (Logical Type = Double) (Semantic Tags = {'numeric', 'currency'})>
[12]:
dc.to_series()
[12]:
0     25.245
1     33.561
2     36.300
3     33.561
4     33.561
       ...
95    25.245
96    83.160
97    21.780
98    21.780
99    21.780
Name: total, Length: 100, dtype: float64

You can also access multiple columns by supplying a list of column names:

[13]:
multiple_cols_dt = dt[['product_id', 'total', 'unit_price']]
multiple_cols_dt
[13]:
Physical Type Logical Type Semantic Tag(s)
Data Column
product_id category Categorical ['category']
total float64 Double ['numeric', 'currency']
unit_price float64 Double ['numeric']

Removing Semantic Tags

We can also remove specific semantic tags from a column if they are no longer needed. Let’s remove the product_details tag from the description column:

[14]:
dt = dt.remove_semantic_tags({'description':'product_details'})
dt
[14]:
Physical Type Logical Type Semantic Tag(s)
Data Column
order_product_id Int64 Integer ['numeric']
order_id Int64 Integer ['numeric']
product_id category Categorical ['category']
description string NaturalLanguage []
quantity category Categorical ['category']
order_date datetime64[ns] Datetime []
unit_price float64 Double ['numeric']
customer_name category Categorical ['category']
country category Categorical ['category']
total float64 Double ['numeric', 'currency']
cancelled boolean Boolean []

Notice how the product_details tag has now been removed from the description column. If we wanted to remove all user-added semantic tags from all columns, we can also do that:

[15]:
dt = dt.reset_semantic_tags()
dt
[15]:
Physical Type Logical Type Semantic Tag(s)
Data Column
order_product_id Int64 Integer ['numeric']
order_id Int64 Integer ['numeric']
product_id category Categorical ['category']
description string NaturalLanguage []
quantity category Categorical ['category']
order_date datetime64[ns] Datetime []
unit_price float64 Double ['numeric']
customer_name category Categorical ['category']
country category Categorical ['category']
total float64 Double ['numeric']
cancelled boolean Boolean []

Set Index and Time Index

At any point, we can designate certain columns as the DataTable’s index and with the methods set_index and set_time_index. These methods can be used to assign these columns for the first time or to change the column being used as the index or time index.

Index and time index columns contain index and time_index semantic tags, respectively.

[16]:
dt = dt.set_index('order_product_id')
dt.index
[16]:
'order_product_id'
[17]:
dt = dt.set_time_index('order_date')
dt.time_index
[17]:
'order_date'
[18]:
dt
[18]:
Physical Type Logical Type Semantic Tag(s)
Data Column
order_product_id Int64 Integer ['index']
order_id Int64 Integer ['numeric']
product_id category Categorical ['category']
description string NaturalLanguage []
quantity category Categorical ['category']
order_date datetime64[ns] Datetime ['time_index']
unit_price float64 Double ['numeric']
customer_name category Categorical ['category']
country category Categorical ['category']
total float64 Double ['numeric']
cancelled boolean Boolean []

List Logical Types

We can also retrieve all the Logical Types present in Woodwork. These can be useful for understanding the Logical Types, and how they will be interpreted.

[19]:
from woodwork.type_sys.utils import list_logical_types

list_logical_types()
[19]:
name type_string description physical_type standard_tags is_default_type is_registered parent_type
0 Boolean boolean Represents Logical Types that contain binary v... boolean {} True True None
1 Categorical categorical Represents Logical Types that contain unordere... category {category} True True None
2 CountryCode country_code Represents Logical Types that contain categori... category {category} True True Categorical
3 Datetime datetime Represents Logical Types that contain date and... datetime64[ns] {} True True None
4 Double double Represents Logical Types that contain positive... float64 {numeric} True True None
5 EmailAddress email_address Represents Logical Types that contain email ad... string {} True True NaturalLanguage
6 Filepath filepath Represents Logical Types that specify location... string {} True True NaturalLanguage
7 FullName full_name Represents Logical Types that may contain firs... string {} True True NaturalLanguage
8 IPAddress ip_address Represents Logical Types that contain IP addre... string {} True True NaturalLanguage
9 Integer integer Represents Logical Types that contain positive... Int64 {numeric} True True None
10 LatLong lat_long Represents Logical Types that contain latitude... object {} True True None
11 NaturalLanguage natural_language Represents Logical Types that contain text or ... string {} True True None
12 Ordinal ordinal Represents Logical Types that contain ordered ... category {category} True True Categorical
13 PhoneNumber phone_number Represents Logical Types that contain numeric ... string {} True True NaturalLanguage
14 SubRegionCode sub_region_code Represents Logical Types that contain codes re... category {category} True True Categorical
15 Timedelta timedelta Represents Logical Types that contain values s... timedelta64[ns] {} True True None
16 URL url Represents Logical Types that contain URLs, wh... string {} True True NaturalLanguage
17 ZIPCode zip_code Represents Logical Types that contain a series... category {category} True True Categorical