Understanding Logical Types and Semantic Tags

In a Woodwork DataFrame, each column has three pieces of typing information associated with it: a physical type, a logical type, and semantic tags.

This guide offers an in-depth walk-through of all of the logical types and semantic tags that Woodwork defines in order to allow users to choose the logical types and semantic tags that most closely describe their data. As a reminder, here are quick definitions of Woodwork’s types:

  • Physical Type: defines how the data is stored on disk or in memory.

  • Logical Type: defines how the data should be parsed or interpreted.

  • Semantic Tag(s): provides additional data about the meaning of the data or how it should be used.

Woodwork will attempt to infer a column’s LogicalType if none is supplied at initialization. A column’s logical type will then inform which physical type and standard semantic tags are applied to it. However, setting the types manually will allow for more accurate typing of a DataFrame.

Having accurate typing information on a Woodwork DataFrame impacts how the data is parsed, transformed, and later interpreted downstream of Woodwork initialization. Therefore, understanding Woodwork’s logical types and semantic tags is essential to downstream usage of Woodwork.

For an in-depth guide on how to set and manipulate these types, see the Working with Types and Tags guide.

For information on how to customize Woodwork’s type system, see the Custom Types and Inference guide.

It’s important to remember that Woodwork columns will always have a logical type and that any semantic tags that are added by Woodwork are meant to add additional meaning onto that logical type. We’ll start out by looking in-depth at semantic tags so that when we get to logical types, we can better understand how a semantic tag might add additional information onto it.

Semantic Tags

Here is the full set of Woodwork-defined semantic tags:

[1]:
import woodwork as ww
ww.list_semantic_tags()
[1]:
name is_standard_tag valid_logical_types
0 numeric True [Age, AgeFractional, AgeNullable, Double, Inte...
1 category True [Categorical, CountryCode, Ordinal, PostalCode...
2 index False Any LogicalType
3 time_index False [Datetime, Age, AgeFractional, AgeNullable, Do...
4 date_of_birth False [Datetime]
5 ignore False Any LogicalType
6 passthrough False Any LogicalType

Standard Tags

Standard tags are associated with specific logical types. They are useful for indicating predefined categories that logical types might fall into.

  • 'numeric' - Is applied to any numeric logical type

    • Uses: Can select for just numeric columns when performing operations that require numeric columns

    • Related Properties: series.ww.is_numeric

  • 'category' - Is applied to any logical type that is categorical in nature

    • Uses: Can select for just categorical columns when performing operations that require categorical columns

    • Related Properties: series.ww.is_categorical

Index Tags

Index tags are added by Woodwork to a DataFrame when an index or time_index column is identified by the user. These tags have some special properties that are only confirmed to be true in the context of a DataFrame (so any Series with these tags may not have these properties).

  • 'index' - Indicates that a column is the DataFrame’s index, or primary key

    • There will only be one index column

    • The contents of an index column will be unique

    • An index column will have any standard semantic tags associated with its logical type removed

    • In pandas DataFrames, the data in an index column will be reflected in the DataFrame’s underlying index

  • 'time_index'

    • There will only be one time index column

    • A time index column will contain either datetime or numeric data

Other Tags

The tags listed below may be added directly to columns during or after Woodwork initialization. They are tags that have suggested meanings and that can be added to columns that will be used in the manner described below. Woodwork will neither add them automatically to a DataFrame nor take direct action upon a column if they are present.

  • 'date_of_birth' - Indicates that a datetime column should be parsed as a date of birth

  • 'ignore'/'passthrough' - Indicates that a column should be ignored during feature engineering or model building but should still be passed through these operations so that the column is not lost.

Additional tags beyond the ones Woodwork adds at initialization may be useful for a DataFrame’s interpretability, so users are encouraged to add any tags that will allow them to use their data more efficiently.

Logical Types

Below are all of the Logical Types that Woodwork defines.

[2]:
import woodwork as ww
ww.list_logical_types()
[2]:
name type_string description physical_type standard_tags is_default_type is_registered parent_type
0 Address address Represents Logical Types that contain address ... string {} True True None
1 Age age Represents Logical Types that contain whole nu... int64 {numeric} True True Integer
2 AgeFractional age_fractional Represents Logical Types that contain non-nega... float64 {numeric} True True Double
3 AgeNullable age_nullable Represents Logical Types that contain whole nu... Int64 {numeric} True True IntegerNullable
4 Boolean boolean Represents Logical Types that contain binary v... bool {} True True BooleanNullable
5 BooleanNullable boolean_nullable Represents Logical Types that contain binary v... boolean {} True True None
6 Categorical categorical Represents Logical Types that contain unordere... category {category} True True None
7 CountryCode country_code Represents Logical Types that use the ISO-3166... category {category} True True Categorical
8 Datetime datetime Represents Logical Types that contain date and... datetime64[ns] {} True True None
9 Double double Represents Logical Types that contain positive... float64 {numeric} True True None
10 EmailAddress email_address Represents Logical Types that contain email ad... string {} True True None
11 Filepath filepath Represents Logical Types that specify location... string {} True True None
12 IPAddress ip_address Represents Logical Types that contain IP addre... string {} True True None
13 Integer integer Represents Logical Types that contain positive... int64 {numeric} True True IntegerNullable
14 IntegerNullable integer_nullable Represents Logical Types that contain positive... Int64 {numeric} True True None
15 LatLong lat_long Represents Logical Types that contain latitude... object {} True True None
16 NaturalLanguage natural_language Represents Logical Types that contain text or ... string {} True True None
17 Ordinal ordinal Represents Logical Types that contain ordered ... category {category} True True Categorical
18 PersonFullName person_full_name Represents Logical Types that may contain firs... string {} True True None
19 PhoneNumber phone_number Represents Logical Types that contain numeric ... string {} True True None
20 PostalCode postal_code Represents Logical Types that contain a series... category {category} True True Categorical
21 SubRegionCode sub_region_code Represents Logical Types that use the ISO-3166... category {category} True True Categorical
22 Timedelta timedelta Represents Logical Types that contain values s... timedelta64[ns] {} True True None
23 URL url Represents Logical Types that contain URLs, wh... string {} True True None
24 Unknown unknown Represents Logical Types that cannot be inferr... string {} True True None

In the DataFrame above, we can see a parent_type column. The parent_type of a LogicalType refers to a logical type that is a more general version of the current LogicalType. See the Custom Types and Type Inference guide for more details on how parent-child relationships between logical types impacts Woodwork’s type inference.

Base LogicalType Class

All logical types used by Woodwork are subclassed off of the base LogicalType class, and since the following behaviors all exist on the LogicalType class, all logical types have the following behavior:

  • All logical types define a dtype that will get used for any column with that logical type - this is how the physical type for a column gets determined

  • All logical types perform some basic transformation into the expected physical type (dtype) - this is how Woodwork LogicalTypes act as a form of data-transformers. Depending on the requirements of a LogicalType, a LogicalType can transform input data into an expected format.

    class LogicalType(object, metaclass=LogicalTypeMetaClass):
        """Base class for all other Logical Types"""
        type_string = ClassNameDescriptor()
        primary_dtype = 'string'
        backup_dtype = None
        standard_tags = set()
    

Default Logical Type

Unknown

When Woodwork’s type inference does not return any LogicalTypes for a column, Woodwork will set the column’s logical type as the default LogicalType, Unknown. A logical type being inferred as Unknown may be a good indicator that a more specific logical type can be chosen and set by the user.

  • physical type: string

Below is an example of a column for which no logical type is inferred, resulting in a Series with Unknown logical type. Looking at the contents of the Series, though, we can see that it contains country codes, so we set the logical type to CountryCode.

[3]:
import pandas as pd

series = pd.Series(["AU", "US", "UA"])
unknown_series = ww.init_series(series)
unknown_series.ww
[3]:
<Series: None (Physical Type = string) (Logical Type = Unknown) (Semantic Tags = set())>
[4]:
countrycode_series = ww.init_series(unknown_series, 'CountryCode')
countrycode_series.ww
[4]:
<Series: None (Physical Type = category) (Logical Type = CountryCode) (Semantic Tags = {'category'})>

Numeric Logical Types

Age

Represents Logical Types that contain whole numbers indicating a person’s age.

  • physical type: int64

  • standard tags: {'numeric'}

AgeFractional

Represents Logical Types that contain non-negative floating point numbers indicating a person’s age. May contain null values.

  • physical type: float64

  • standard tags: {'numeric'}

AgeNullable

Represents Logical Types that contain whole numbers indicating a person’s age. May contain null values.

  • physical type: Int64

  • standard tags: {'numeric'}

Double

Represents Logical Types that contain positive and negative numbers, some of which include a fractional component.

  • physical type: float64

  • standard tags: {'numeric'}

Integer

Represents Logical Types that contain positive and negative numbers without a fractional component, including zero (0).

  • physical type: int64

  • standard tags: {'numeric'}

IntegerNullable

Represents Logical Types that contain positive and negative numbers without a fractional component, including zero (0). May contain null values.

  • physical type: Int64

  • standard tags: {'numeric'}

Below we’ll find a dataframe with examples of each of the numeric LogicalTypes

[5]:
numerics_df = pd.DataFrame({
    'ints' : [1, 2, 3, 4],
    'ints_nullable': pd.Series([1, 2, None, 4], dtype='Int64'),
    'floats' : [0.0, 1.1, 2.2, 3.3],
    'ages': [18, 22, 24, 34],
    'ages_nullable' : [None, 2, 22, 33]
})

numerics_df.ww.init(logical_types={'ages':'Age', 'ages_nullable':'AgeNullable'})
numerics_df.ww
[5]:
Physical Type Logical Type Semantic Tag(s)
Column
ints int64 Integer ['numeric']
ints_nullable Int64 IntegerNullable ['numeric']
floats float64 Double ['numeric']
ages int64 Age ['numeric']
ages_nullable Int64 AgeNullable ['numeric']

Categorical Logical Types

Categorical

Represents a Logical Type with few unique values relative to the size of the data.

  • physical type: category

  • inference: Woodwork defines a threshold for percentage unique values relative to the size of the series below which a series will be considered categorical. See setting config options guide for more information on how to control this threshold.

  • koalas note: Koalas does not support the category dtype, so for Koalas DataFrames and Series, the string dtype will be used.

Some examples of data for which the Categorical logical type would apply:

  • Gender

  • Eye Color

  • Nationality

  • Hair Color

  • Spoken Language

CountryCode

Represents Logical Types that use the ISO-3166 standard country code to represent countries. ISO 3166-1 (countries) are supported. These codes should be in the Alpha-2 format.

  • physical type: category

  • standard tags: {'category'}

  • koalas note: Koalas does not support the category dtype, so for Koalas DataFrames and Series, the string dtype will be used.

For example: 'AU' for Australia, 'CN' for China, and 'CA' for Canada.

Ordinal

A Ordinal variable type can take ordered discrete values. Similar to Categorical, it is usually a limited, and fixed number of possible values. However, these discrete values have a certain order, and the ordering is important to understanding the values. Ordinal variable types can be represented as strings, or integers.

  • physical type: category

  • standard tags: {'category'}

  • parameters:

    • order - the order of the ordinal values in the column from low to high

  • validation - an order must be defined for an Ordinal column on a DataFrame or Series, and all elements of the order must be present.

  • koalas note: Koalas does not support the category dtype, so for Koalas DataFrames and Series, the string dtype will be used.

Some examples of data for which the Ordinal logical type would apply:

  • Educational Background (Elementary, High School, Undergraduate, Graduate)

  • Satisfaction Rating (Not Satisfied, Satisfied, Very Satisfied)

  • Spicy Level (Hot, Hotter, Hottest)

  • Student Grade (A, B, C, D, F)

  • Size (small, medium, large)

PostalCode

Represents Logical Types that contain a series of postal codes for representing a group of addresses.

  • physical type: category

  • standard tags: {'category'}

  • koalas note: Koalas does not support the category dtype, so for Koalas DataFrames and Series, the string dtype will be used.

SubRegionCode

Represents Logical Types that use the ISO-3166 standard sub-region code to represent a portion of a larger geographic region. ISO 3166-2 (sub-regions) codes are supported. These codes should be in the Alpha-2 format.

  • physical type: category

  • standard tags: {'category'}

  • koalas note: Koalas does not support the category dtype, so for Koalas DataFrames and Series, the string dtype will be used.

For example: 'US-IL' to represent Illinois in the United States or 'AU-TAS' to represent Tasmania in Australia.

[6]:
categoricals_df = pd.DataFrame({
    'categorical': pd.Series(['a', 'b', 'a', 'a'], dtype='category'),
    'ordinal' : ['small', 'large', 'large', 'medium'],
    'country_code' : ["AU", "US", "UA", "AU"],
    'postal_code': ["90210", "60035", "SW1A", "90210" ],
    'sub_region_code' : ["AU-NSW", "AU-TAS", "AU-QLD", "AU-QLD"]
})
categoricals_df.ww.init(logical_types={'ordinal':ww.logical_types.Ordinal(order=['small', 'medium', 'large']),
                                       'country_code':'CountryCode',
                                       'postal_code':'PostalCode',
                                       'sub_region_code':'SubRegionCode'})

categoricals_df.ww
[6]:
Physical Type Logical Type Semantic Tag(s)
Column
categorical category Categorical ['category']
ordinal category Ordinal ['category']
country_code category CountryCode ['category']
postal_code category PostalCode ['category']
sub_region_code category SubRegionCode ['category']

Miscellaneous Logical Types with Specific Formats

Boolean

Represents Logical Types that contain binary values indicating true/false.

  • physical type: bool

BooleanNullable

Represents Logical Types that contain binary values indicating true/false. May also contain null values.

  • physical type: boolean

Datetime

A Datetime is a representation of a date and/or time. Datetime variable types can be represented as strings, or integers.

  • physical type: datetime64[ns]

  • transformation: Will convert valid strings or numbers to pandas datetimes, and will parse more datetime formats with the use of the datetime_format parameter.

  • parameters:

    • datetime_format - the format of the datetimes in the column, ex: '%Y-%m-%d' vs '%m-%d-%Y'

Some examples of Datetime include:

  • Transaction Time

  • Flight Departure Time

  • Pickup Time

EmailAddress

Represents Logical Types that contain email address values.

  • physical type: string

  • inference: Uses an email address regex that, if the data matches, means that the column contains email addresses. To learn more about controling the regex used, see the setting config options guide.

LatLong

A LatLong represents an ordered pair (Latitude, Longitude) that tells the location on Earth. The order of the tuple is important. LatLongs can be represented as tuple of floating point numbers.

  • physical type: object

  • transformation: Will convert inputs into a tuple of floats. Any null values will be stored as np.nan

  • koalas note: Koalas does not support tuples, so latlongs will be stored as a list of floats

Timedelta

Represents Logical Types that contain values specifying a duration of time.

  • physical type: timedelta64[ns]

Examples could inclue:

  • Days/months/years since some event

  • How long a flight’s arrival was delayed/early

  • Days until birthday

Below, we’ll see a DataFrame that contains data for each of these logical types. Some columns like dates and latlongs will have their data transformed to a format that Woodwork expects.

[7]:
df = pd.DataFrame({
    'dates': ["2019/01/01", "2019/01/02", "2019/01/03", "2019/01/03"],
    'latlongs': ['[33.670914, -117.841501]', '40.423599, -86.921162', (-45.031705, None), None],
    'booleans': [True, True, False, True],
    'bools_nullable': pd.Series([True, False, True, None], dtype='boolean'),
    'timedelta': [pd.Timedelta('1 days 00:00:00'), pd.Timedelta('-1 days +23:40:00'),
             pd.Timedelta('4 days 12:00:00'), pd.Timedelta('-1 days +23:40:00')],
    'emails':["[email protected]", "[email protected]", "[email protected]", "[email protected]"]
})
df
[7]:
dates latlongs booleans bools_nullable timedelta emails
0 2019/01/01 [33.670914, -117.841501] True True 1 days 00:00:00 [email protected]
1 2019/01/02 40.423599, -86.921162 True False -1 days +23:40:00 [email protected]
2 2019/01/03 (-45.031705, None) False True 4 days 12:00:00 [email protected]
3 2019/01/03 None True <NA> -1 days +23:40:00 [email protected]
[8]:
df.ww.init(logical_types={'latlongs':'LatLong',
                          'dates':ww.logical_types.Datetime(datetime_format='%Y/%m/%d')})
df.ww
[8]:
Physical Type Logical Type Semantic Tag(s)
Column
dates datetime64[ns] Datetime []
latlongs object LatLong []
booleans bool Boolean []
bools_nullable boolean BooleanNullable []
timedelta timedelta64[ns] Timedelta []
emails string EmailAddress []
[9]:
df
[9]:
dates latlongs booleans bools_nullable timedelta emails
0 2019-01-01 (33.670914, -117.841501) True True 1 days 00:00:00 [email protected]
1 2019-01-02 (40.423599, -86.921162) True False -1 days +23:40:00 [email protected]
2 2019-01-03 (-45.031705, nan) False True 4 days 12:00:00 [email protected]
3 2019-01-03 NaN True <NA> -1 days +23:40:00 [email protected]

String Logical Types

NaturalLanguage

Represents Logical Types that contain long-form text or characters representing natural human language

  • physical type: string

Examples of natural language data:

  • “Any additional comments” in a feedback form

  • Customer Review

  • Patient Notes

Address

Represents Logical Types that contain address values.

  • physical type: string

Filepath

Represents Logical Types that specify locations of directories and files in a file system.

  • physical type: string

PersonFullName

Represents Logical Types that may contain first, middle and last names, including honorifics and suffixes.

  • physical type: string

PhoneNumber

Represents Logical Types that contain numeric digits and characters representing a phone number.

  • physical type: string

URL

Represents Logical Types that contain URLs, which may include protocol, hostname and file name.

  • physical type: string

IPAddress

Represents Logical Types that contain IP addresses, including both IPv4 and IPv6 addresses.

  • physical type: string

[10]:
strings_df = pd.DataFrame({
    'natural_language':["This is a short sentence.",
                         "I like to eat pizza!",
                         "When will humans go to mars?",
                         "This entry contains two sentences. Second sentence."],
    'addresses':['1 Miller Drive, New York, NY 12345', '1 Berkeley Street, Boston, MA 67891',
                '26387 Russell Hill, Dallas, TX 34521', '54305 Oxford Street, Seattle, WA 95132'],
    'filepaths':["/usr/local/bin",
                 "/Users/john.smith/dev/index.html",
                 "/tmp",
                 "../woodwork"],
    'full_names':["Mr. John Doe, Jr.",
                 "Doe, Mrs. Jane",
                 "James Brown",
                 "John Smith"],
    'phone_numbers':["1-(555)-123-5495",
                     "+1-555-123-5495",
                     "5551235495",
                     "111-222-3333"],
    'urls': ["http://google.com",
             "https://example.com/index.html",
             "example.com",
             "https://woodwork.alteryx.com/"],
    'ip_addresses': ["172.16.254.1",
                     "192.0.0.0",
                     "2001:0db8:0000:0000:0000:ff00:0042:8329",
                     "192.0.0.0"],
})
strings_df.ww.init(logical_types={
    'natural_language':'NaturalLanguage',
    'addresses':'Address',
    'filepaths':'FilePath',
    'full_names':'PersonFullName',
    'phone_numbers':'PhoneNumber',
    'urls':'URL',
    'ip_addresses':'IPAddress',
})
strings_df.ww
[10]:
Physical Type Logical Type Semantic Tag(s)
Column
natural_language string NaturalLanguage []
addresses string Address []
filepaths string Filepath []
full_names string PersonFullName []
phone_numbers string PhoneNumber []
urls string URL []
ip_addresses string IPAddress []

ColumnSchema objects

Now that we’ve gone in-depth on semantic tags and logical types, we can start to understand how they’re used together to build Woodwork tables and define type spaces.

A ColumnSchema is the typing information for a single column. We can obtain a ColumnSchema from a Woodwork-initialized DataFrame as follows:

[11]:
# Woodwork typing info for a DataFrame
retail_df = ww.demo.load_retail()
retail_df.ww
[11]:
Physical Type Logical Type Semantic Tag(s)
Column
order_product_id category Categorical ['index']
order_id category Categorical ['category']
product_id category Categorical ['category']
description string NaturalLanguage []
quantity int64 Integer ['numeric']
order_date datetime64[ns] Datetime ['time_index']
unit_price float64 Double ['numeric']
customer_name category Categorical ['category']
country category Categorical ['category']
total float64 Double ['numeric']
cancelled bool Boolean []

Above is the typing information for a Woodwork DataFrame. If we want, we can access just the schema of typing information outside of the context of the actual data in the DataFrame.

[12]:
# A Woodwork TableSchema
retail_df.ww.schema
[12]:
Logical Type Semantic Tag(s)
Column
order_product_id Categorical ['index']
order_id Categorical ['category']
product_id Categorical ['category']
description NaturalLanguage []
quantity Integer ['numeric']
order_date Datetime ['time_index']
unit_price Double ['numeric']
customer_name Categorical ['category']
country Categorical ['category']
total Double ['numeric']
cancelled Boolean []

The representation of the woodwork.table_schema.TableSchema is only different in that it does not have a column for the physical types.

This lack of a physical type is due to the fact that a TableSchema has no data, and therefore no physical representation of the data. We often rely on physical typing information to know the exact pandas or Dask or Koalas operations that are valid for a DataFrame, but for a schema of typing information that is not tied to data, those operations are not relevant.

Now, let’s look at a single column of typing information, or a woodwork.column_schema.ColumnSchema that we can aquire in much the same way as we can select a Series from the DataFrame:

[13]:
# Woodwork typing infor for a Series
quantity = retail_df.ww['quantity']
quantity.ww
[13]:
<Series: quantity (Physical Type = int64) (Logical Type = Integer) (Semantic Tags = {'numeric'})>
[14]:
# A Woodwork ColumnSchema
quantity_schema = quantity.ww.schema
quantity_schema
[14]:
<ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>

The column_schema object above can be understood as typing information for a single column that is not tied to any data. In this case, we happen to know where the column schema came from - it was the quantity column from the retail_df DataFrame. But we can also create a ColumnSchema that exists without being associated with any individual column of data.

If we look again at the retail_df table as a whole, we can see the similarities and differences between the columns, and we can describe those subsets of the DataFrame with ColumnSchema objects, or type spaces.

[15]:
retail_df.ww.schema
[15]:
Logical Type Semantic Tag(s)
Column
order_product_id Categorical ['index']
order_id Categorical ['category']
product_id Categorical ['category']
description NaturalLanguage []
quantity Integer ['numeric']
order_date Datetime ['time_index']
unit_price Double ['numeric']
customer_name Categorical ['category']
country Categorical ['category']
total Double ['numeric']
cancelled Boolean []

Below are several ColumnSchemas that all would include our quantity column, but each of them describe a different type space. These ColumnSchemas get more restrictive as we go down:

  • <ColumnSchema > - No restrictions have been placed; any column falls into this definition.

  • <ColumnSchema (Semantic Tags = ['numeric'])> - Only columns with the numeric tag apply. This can include Double, Integer, and Age logical type columns as well.

  • <ColumnSchema (Logical Type = Integer)> - Only columns with logical type of Integer are included in this definition. Does not require the numeric tag, so an index column (which has its standard tags removed) would still apply

  • <ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])> - The column must have logical type Integer and have the numeric semantic tag, excluding index columns.

In this way, a ColumnSchema can define a type space under which columns in a Woodwork DataFrame can fall.

Checking for nullable logical types

Some logical types support having null values in the underlying data while others do not. This is entirely based on whether a logical type’s underlying primary dtype or backup dtype supports null values. For example, the EmailAddress logical type has an underlying primary dtype of string. Pandas allows series with the dtype string to contain null values marked by the pandas.NA sentinel. Therefore, EmailAddress supports null values. On the other hand, the Integer logical type does not support null values since its underlying primary pandas dtype is int64. Pandas does not allow null values in series with the dtype int64. However, pandas does allow null values in series with the dtype Int64. Therefore, the IntegerNullable logical type supports null values since its primary dtype is Int64.

You can check if a column contains a nullable logical type by using nullable on the column accessor. The sections above that describe each type’s characteristics include information about whether or not a logical type is nullable.

[16]:
df.ww['bools_nullable'].ww.nullable
[16]:
True