Understanding Logical Types and Semantic Tags#

In a Woodwork DataFrame, each column has three pieces of typing information associated with it: a physical type, a logical type, and semantic tags.

This guide offers an in-depth walk-through of all of the logical types and semantic tags that Woodwork defines in order to allow users to choose the logical types and semantic tags that most closely describe their data. As a reminder, here are quick definitions of Woodwork’s types:

Physical Type: defines how the data is stored on disk or in memory.
Logical Type: defines how the data should be parsed or interpreted.
Semantic Tag(s): provides additional data about the meaning of the data or how it should be used.

Woodwork will attempt to infer a column’s LogicalType if none is supplied at initialization. A column’s logical type will then inform which physical type and standard semantic tags are applied to it. However, setting the types manually will allow for more accurate typing of a DataFrame.

Having accurate typing information on a Woodwork DataFrame impacts how the data is parsed, transformed, and later interpreted downstream of Woodwork initialization. Therefore, understanding Woodwork’s logical types and semantic tags is essential to downstream usage of Woodwork.

For an in-depth guide on how to set and manipulate these types, see the Working with Types and Tags guide.

For information on how to customize Woodwork’s type system, see the Custom Types and Inference guide.

It’s important to remember that Woodwork columns will always have a logical type and that any semantic tags that are added by Woodwork are meant to add additional meaning onto that logical type. We’ll start out by looking in-depth at semantic tags so that when we get to logical types, we can better understand how a semantic tag might add additional information onto it.

Semantic Tags#

Here is the full set of Woodwork-defined semantic tags:

[1]:

import woodwork as ww

ww.list_semantic_tags()

[1]:

	name	is_standard_tag	valid_logical_types
0	numeric	True	[Age, AgeFractional, AgeNullable, Double, Inte...
1	category	True	[Categorical, CountryCode, CurrencyCode, Ordin...
2	index	False	Any LogicalType
3	time_index	False	[Datetime, Age, AgeFractional, AgeNullable, Do...
4	date_of_birth	False	[Datetime]
5	ignore	False	Any LogicalType
6	passthrough	False	Any LogicalType

Standard Tags#

Standard tags are associated with specific logical types. They are useful for indicating predefined categories that logical types might fall into.

'numeric' - Is applied to any numeric logical type
- Uses: Can select for just numeric columns when performing operations that require numeric columns
- Related Properties: series.ww.is_numeric
'category' - Is applied to any logical type that is categorical in nature
- Uses: Can select for just categorical columns when performing operations that require categorical columns
- Related Properties: series.ww.is_categorical

Index Tags#

Index tags are added by Woodwork to a DataFrame when an index or time_index column is identified by the user. These tags have some special properties that are only confirmed to be true in the context of a DataFrame (so any Series with these tags may not have these properties).

'index' - Indicates that a column is the DataFrame’s index, or primary key
- There will only be one index column
- The contents of an index column will be unique
- An index column will have any standard semantic tags associated with its logical type removed
- In pandas DataFrames, the data in an index column will be reflected in the DataFrame’s underlying index
'time_index'
- There will only be one time index column
- A time index column will contain either datetime or numeric data

Other Tags#

The tags listed below may be added directly to columns during or after Woodwork initialization. They are tags that have suggested meanings and that can be added to columns that will be used in the manner described below. Woodwork will neither add them automatically to a DataFrame nor take direct action upon a column if they are present.

'date_of_birth' - Indicates that a datetime column should be parsed as a date of birth
'ignore'/'passthrough' - Indicates that a column should be ignored during feature engineering or model building but should still be passed through these operations so that the column is not lost.

Additional tags beyond the ones Woodwork adds at initialization may be useful for a DataFrame’s interpretability, so users are encouraged to add any tags that will allow them to use their data more efficiently.

Logical Types#

Below are all of the Logical Types that Woodwork defines.

[2]:

import woodwork as ww

ww.list_logical_types()

[2]:

	name	type_string	description	physical_type	standard_tags	is_default_type	is_registered	parent_type
0	Address	address	Represents Logical Types that contain address ...	string	{}	True	True	None
1	Age	age	Represents Logical Types that contain whole nu...	int64	{numeric}	True	True	Integer
2	AgeFractional	age_fractional	Represents Logical Types that contain non-nega...	float64	{numeric}	True	True	Double
3	AgeNullable	age_nullable	Represents Logical Types that contain whole nu...	Int64	{numeric}	True	True	IntegerNullable
4	Boolean	boolean	Represents Logical Types that contain binary v...	bool	{}	True	True	BooleanNullable
5	BooleanNullable	boolean_nullable	Represents Logical Types that contain binary v...	boolean	{}	True	True	None
6	Categorical	categorical	Represents Logical Types that contain unordere...	category	{category}	True	True	None
7	CountryCode	country_code	Represents Logical Types that use the ISO-3166...	category	{category}	True	True	Categorical
8	CurrencyCode	currency_code	Represents Logical Types that use the ISO-4217...	category	{category}	True	True	Categorical
9	Datetime	datetime	Represents Logical Types that contain date and...	datetime64[ns]	{}	True	True	None
10	Double	double	Represents Logical Types that contain positive...	float64	{numeric}	True	True	None
11	EmailAddress	email_address	Represents Logical Types that contain email ad...	string	{}	True	True	Unknown
12	Filepath	filepath	Represents Logical Types that specify location...	string	{}	True	True	None
13	IPAddress	ip_address	Represents Logical Types that contain IP addre...	string	{}	True	True	Unknown
14	Integer	integer	Represents Logical Types that contain positive...	int64	{numeric}	True	True	IntegerNullable
15	IntegerNullable	integer_nullable	Represents Logical Types that contain positive...	Int64	{numeric}	True	True	None
16	LatLong	lat_long	Represents Logical Types that contain latitude...	object	{}	True	True	None
17	NaturalLanguage	natural_language	Represents Logical Types that contain text or ...	string	{}	True	True	None
18	Ordinal	ordinal	Represents Logical Types that contain ordered ...	category	{category}	True	True	Categorical
19	PersonFullName	person_full_name	Represents Logical Types that may contain firs...	string	{}	True	True	None
20	PhoneNumber	phone_number	Represents Logical Types that contain numeric ...	string	{}	True	True	Unknown
21	PostalCode	postal_code	Represents Logical Types that contain a series...	category	{category}	True	True	Categorical
22	SubRegionCode	sub_region_code	Represents Logical Types that use the ISO-3166...	category	{category}	True	True	Categorical
23	Timedelta	timedelta	Represents Logical Types that contain values s...	timedelta64[ns]	{}	True	True	Unknown
24	URL	url	Represents Logical Types that contain URLs, wh...	string	{}	True	True	Unknown
25	Unknown	unknown	Represents Logical Types that cannot be inferr...	string	{}	True	True	None

In the DataFrame above, we can see a parent_type column. The parent_type of a LogicalType refers to a logical type that is a more general version of the current LogicalType. See the Custom Types and Type Inference guide for more details on how parent-child relationships between logical types impacts Woodwork’s type inference.

Base LogicalType Class#

All logical types used by Woodwork are subclassed off of the base LogicalType class, and since the following behaviors all exist on the LogicalType class, all logical types have the following behavior:

All logical types define a dtype that will get used for any column with that logical type - this is how the physical type for a column gets determined
All logical types perform some basic transformation into the expected physical type (dtype) - this is how Woodwork LogicalTypes act as a form of data-transformers. Depending on the requirements of a LogicalType, a LogicalType can transform input data into an expected format.
```
class LogicalType(object, metaclass=LogicalTypeMetaClass):
    """Base class for all other Logical Types"""
    type_string = ClassNameDescriptor()
    primary_dtype = 'string'
    standard_tags = set()
```

Default Logical Type#

Unknown#

When Woodwork’s type inference does not return any LogicalTypes for a column, Woodwork will set the column’s logical type as the default LogicalType, Unknown. A logical type being inferred as Unknown may be a good indicator that a more specific logical type can be chosen and set by the user.

physical type: string

Below is an example of a column for which no logical type is inferred, resulting in a Series with Unknown logical type. Looking at the contents of the Series, though, we can see that it contains country codes, so we set the logical type to CountryCode.

[3]:

import pandas as pd

series = pd.Series(["AU", "US", "UA"])
unknown_series = ww.init_series(series)
unknown_series.ww

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(

[3]:

<Series: None (Physical Type = string) (Logical Type = Unknown) (Semantic Tags = set())>

[4]:

countrycode_series = ww.init_series(unknown_series, "CountryCode")
countrycode_series.ww

[4]:

<Series: None (Physical Type = category) (Logical Type = CountryCode) (Semantic Tags = {'category'})>

Numeric Logical Types#

Age#

Represents Logical Types that contain whole numbers indicating a person’s age.

physical type: int64
standard tags: {'numeric'}

AgeFractional#

Represents Logical Types that contain non-negative floating point numbers indicating a person’s age. May contain null values.

physical type: float64
standard tags: {'numeric'}

AgeNullable#

Represents Logical Types that contain whole numbers indicating a person’s age. May contain null values.

physical type: Int64
standard tags: {'numeric'}

Double#

Represents Logical Types that contain positive and negative numbers, some of which include a fractional component.

physical type: float64
standard tags: {'numeric'}

Integer#

Represents Logical Types that contain positive and negative numbers without a fractional component, including zero (0).

physical type: int64
standard tags: {'numeric'}

IntegerNullable#

Represents Logical Types that contain positive and negative numbers without a fractional component, including zero (0). May contain null values.

physical type: Int64
standard tags: {'numeric'}

Below we’ll find a dataframe with examples of each of the numeric LogicalTypes

[5]:

numerics_df = pd.DataFrame(
    {
        "ints": [1, 2, 3, 4],
        "ints_nullable": pd.Series([1, 2, None, 4], dtype="Int64"),
        "floats": [0.0, 1.1, 2.2, 3.3],
        "ages": [18, 22, 24, 34],
        "ages_nullable": [None, 2, 22, 33],
    }
)

numerics_df.ww.init(logical_types={"ages": "Age", "ages_nullable": "AgeNullable"})
numerics_df.ww

[5]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
ints	int64	Integer	['numeric']
ints_nullable	Int64	IntegerNullable	['numeric']
floats	float64	Double	['numeric']
ages	int64	Age	['numeric']
ages_nullable	Int64	AgeNullable	['numeric']

Categorical Logical Types#

Categorical#

Represents a Logical Type with few unique values relative to the size of the data.

physical type: category
inference: Woodwork defines a threshold for percentage unique values relative to the size of the series below which a series will be considered categorical. See setting config options guide for more information on how to control this threshold.

Some examples of data for which the Categorical logical type would apply:

Gender
Eye Color
Nationality
Hair Color
Spoken Language

CountryCode#

Represents Logical Types that use the ISO-3166 standard country code to represent countries. ISO 3166-1 (countries) are supported. These codes should be in the Alpha-2 format.

physical type: category
standard tags: {'category'}

For example: 'AU' for Australia, 'CN' for China, and 'CA' for Canada.

Ordinal#

A Ordinal variable type can take ordered discrete values. Similar to Categorical, it is usually a limited, and fixed number of possible values. However, these discrete values have a certain order, and the ordering is important to understanding the values. Ordinal variable types can be represented as strings, or integers.

physical type: category
standard tags: {'category'}
parameters:
- order - the order of the ordinal values in the column from low to high
validation - an order must be defined for an Ordinal column on a DataFrame or Series, and all elements of the order must be present.

Some examples of data for which the Ordinal logical type would apply:

Educational Background (Elementary, High School, Undergraduate, Graduate)
Satisfaction Rating (Not Satisfied, Satisfied, Very Satisfied)
Spicy Level (Hot, Hotter, Hottest)
Student Grade (A, B, C, D, F)
Size (small, medium, large)

PostalCode#

Represents Logical Types that contain a series of postal codes for representing a group of addresses.

physical type: category
standard tags: {'category'}

SubRegionCode#

Represents Logical Types that use the ISO-3166 standard sub-region code to represent a portion of a larger geographic region. ISO 3166-2 (sub-regions) codes are supported. These codes should be in the Alpha-2 format.

physical type: category
standard tags: {'category'}

For example: 'US-IL' to represent Illinois in the United States or 'AU-TAS' to represent Tasmania in Australia.

[6]:

categoricals_df = pd.DataFrame(
    {
        "categorical": pd.Series(["a", "b", "a", "a"], dtype="category"),
        "ordinal": ["small", "large", "large", "medium"],
        "country_code": ["AU", "US", "UA", "AU"],
        "postal_code": ["90210", "60035", "SW1A", "90210"],
        "sub_region_code": ["AU-NSW", "AU-TAS", "AU-QLD", "AU-QLD"],
    }
)
categoricals_df.ww.init(
    logical_types={
        "ordinal": ww.logical_types.Ordinal(order=["small", "medium", "large"]),
        "country_code": "CountryCode",
        "postal_code": "PostalCode",
        "sub_region_code": "SubRegionCode",
    }
)

categoricals_df.ww

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(

[6]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
categorical	category	Categorical	['category']
ordinal	category	Ordinal: ['small', 'medium', 'large']	['category']
country_code	category	CountryCode	['category']
postal_code	category	PostalCode	['category']
sub_region_code	category	SubRegionCode	['category']

Miscellaneous Logical Types with Specific Formats#

Boolean#

Represents Logical Types that contain binary values indicating true/false.

physical type: bool

BooleanNullable#

Represents Logical Types that contain binary values indicating true/false. May also contain null values.

physical type: boolean

Datetime#

A Datetime is a representation of a date and/or time. Datetime variable types can be represented as strings, or integers.

physical type: datetime64[ns]
transformation: Will convert valid strings or numbers to pandas datetimes, and will parse more datetime formats with the use of the datetime_format parameter.
parameters:
- datetime_format - the format of the datetimes in the column, ex: '%Y-%m-%d' vs '%m-%d-%Y'

Some examples of Datetime include:

Transaction Time
Flight Departure Time
Pickup Time

EmailAddress#

Represents Logical Types that contain email address values.

physical type: string
inference: Uses an email address regex that, if the data matches, means that the column contains email addresses. To learn more about controling the regex used, see the setting config options guide.

LatLong#

A LatLong represents an ordered pair (Latitude, Longitude) that tells the location on Earth. The order of the tuple is important. LatLongs can be represented as tuple of floating point numbers.

physical type: object
transformation: Will convert inputs into a tuple of floats. Any null values will be stored as np.nan

Timedelta#

Represents Logical Types that contain values specifying a duration of time.

physical type: timedelta64[ns]

Examples could inclue:

Days/months/years since some event
How long a flight’s arrival was delayed/early
Days until birthday

Below, we’ll see a DataFrame that contains data for each of these logical types. Some columns like dates and latlongs will have their data transformed to a format that Woodwork expects.

[7]:

df = pd.DataFrame(
    {
        "dates": ["2019/01/01", "2019/01/02", "2019/01/03", "2019/01/03"],
        "latlongs": [
            "[33.670914, -117.841501]",
            "40.423599, -86.921162",
            (-45.031705, None),
            None,
        ],
        "booleans": [True, True, False, True],
        "bools_nullable": pd.Series([True, False, True, None], dtype="boolean"),
        "timedelta": [
            pd.Timedelta("1 days 00:00:00"),
            pd.Timedelta("-1 days +23:40:00"),
            pd.Timedelta("4 days 12:00:00"),
            pd.Timedelta("-1 days +23:40:00"),
        ],
        "emails": [
            "[email protected]",
            "[email protected]",
            "[email protected]",
            "[email protected]",
        ],
    }
)
df

[7]:

	dates	latlongs	booleans	bools_nullable	timedelta	emails
0	2019/01/01	[33.670914, -117.841501]	True	True	1 days 00:00:00	[email protected]
1	2019/01/02	40.423599, -86.921162	True	False	-1 days +23:40:00	[email protected]
2	2019/01/03	(-45.031705, None)	False	True	4 days 12:00:00	[email protected]
3	2019/01/03	None	True	<NA>	-1 days +23:40:00	[email protected]

[8]:

df.ww.init(
    logical_types={
        "latlongs": "LatLong",
        "dates": ww.logical_types.Datetime(datetime_format="%Y/%m/%d"),
    }
)
df.ww

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(

[8]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
dates	datetime64[ns]	Datetime	[]
latlongs	object	LatLong	[]
booleans	bool	Boolean	[]
bools_nullable	boolean	BooleanNullable	[]
timedelta	timedelta64[ns]	Timedelta	[]
emails	string	EmailAddress	[]

[9]:

df

[9]:

	dates	latlongs	booleans	bools_nullable	timedelta	emails
0	2019-01-01	(33.670914, -117.841501)	True	True	1 days 00:00:00	[email protected]
1	2019-01-02	(40.423599, -86.921162)	True	False	-1 days +23:40:00	[email protected]
2	2019-01-03	(-45.031705, nan)	False	True	4 days 12:00:00	[email protected]
3	2019-01-03	NaN	True	<NA>	-1 days +23:40:00	[email protected]

String Logical Types#

NaturalLanguage#

Represents Logical Types that contain long-form text or characters representing natural human language

physical type: string

Examples of natural language data:

“Any additional comments” in a feedback form
Customer Review
Patient Notes

Address#

Represents Logical Types that contain address values.

physical type: string

Filepath#

Represents Logical Types that specify locations of directories and files in a file system.

physical type: string

PersonFullName#

Represents Logical Types that may contain first, middle and last names, including honorifics and suffixes.

physical type: string

PhoneNumber#

Represents Logical Types that contain numeric digits and characters representing a phone number.

physical type: string

URL#

Represents Logical Types that contain URLs, which may include protocol, hostname and file name.

physical type: string

IPAddress#

Represents Logical Types that contain IP addresses, including both IPv4 and IPv6 addresses.

physical type: string

[10]:

strings_df = pd.DataFrame(
    {
        "natural_language": [
            "This is a short sentence.",
            "I like to eat pizza!",
            "When will humans go to mars?",
            "This entry contains two sentences. Second sentence.",
        ],
        "addresses": [
            "1 Miller Drive, New York, NY 12345",
            "1 Berkeley Street, Boston, MA 67891",
            "26387 Russell Hill, Dallas, TX 34521",
            "54305 Oxford Street, Seattle, WA 95132",
        ],
        "filepaths": [
            "/usr/local/bin",
            "/Users/john.smith/dev/index.html",
            "/tmp",
            "../woodwork",
        ],
        "full_names": [
            "Mr. John Doe, Jr.",
            "Doe, Mrs. Jane",
            "James Brown",
            "John Smith",
        ],
        "phone_numbers": [
            "1-(555)-123-5495",
            "+1-555-123-5495",
            "5551235495",
            "111-222-3333",
        ],
        "urls": [
            "http://google.com",
            "https://example.com/index.html",
            "example.com",
            "https://woodwork.alteryx.com/",
        ],
        "ip_addresses": [
            "172.16.254.1",
            "192.0.0.0",
            "2001:0db8:0000:0000:0000:ff00:0042:8329",
            "192.0.0.0",
        ],
    }
)
strings_df.ww.init(
    logical_types={
        "natural_language": "NaturalLanguage",
        "addresses": "Address",
        "filepaths": "FilePath",
        "full_names": "PersonFullName",
        "phone_numbers": "PhoneNumber",
        "urls": "URL",
        "ip_addresses": "IPAddress",
    }
)
strings_df.ww

[10]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
natural_language	string	NaturalLanguage	[]
addresses	string	Address	[]
filepaths	string	Filepath	[]
full_names	string	PersonFullName	[]
phone_numbers	string	PhoneNumber	[]
urls	string	URL	[]
ip_addresses	string	IPAddress	[]

ColumnSchema objects#

Now that we’ve gone in-depth on semantic tags and logical types, we can start to understand how they’re used together to build Woodwork tables and define type spaces.

A ColumnSchema is the typing information for a single column. We can obtain a ColumnSchema from a Woodwork-initialized DataFrame as follows:

[11]:

# Woodwork typing info for a DataFrame
retail_df = ww.demo.load_retail()
retail_df.ww

[11]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
order_product_id	category	Categorical	['index']
order_id	category	Categorical	['category']
product_id	category	Categorical	['category']
description	string	NaturalLanguage	[]
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	['time_index']
unit_price	float64	Double	['numeric']
customer_name	category	Categorical	['category']
country	category	Categorical	['category']
total	float64	Double	['numeric']
cancelled	bool	Boolean	[]

Above is the typing information for a Woodwork DataFrame. If we want, we can access just the schema of typing information outside of the context of the actual data in the DataFrame.

[12]:

# A Woodwork TableSchema
retail_df.ww.schema

[12]:

	Logical Type	Semantic Tag(s)
Column
order_product_id	Categorical	['index']
order_id	Categorical	['category']
product_id	Categorical	['category']
description	NaturalLanguage	[]
quantity	Integer	['numeric']
order_date	Datetime	['time_index']
unit_price	Double	['numeric']
customer_name	Categorical	['category']
country	Categorical	['category']
total	Double	['numeric']
cancelled	Boolean	[]

The representation of the woodwork.table_schema.TableSchema is only different in that it does not have a column for the physical types.

This lack of a physical type is due to the fact that a TableSchema has no data, and therefore no physical representation of the data. We often rely on physical typing information to know the exact pandas operations that are valid for a DataFrame, but for a schema of typing information that is not tied to data, those operations are not relevant.

Now, let’s look at a single column of typing information, or a woodwork.column_schema.ColumnSchema that we can aquire in much the same way as we can select a Series from the DataFrame:

[13]:

# Woodwork typing infor for a Series
quantity = retail_df.ww["quantity"]
quantity.ww

[13]:

<Series: quantity (Physical Type = int64) (Logical Type = Integer) (Semantic Tags = {'numeric'})>

[14]:

# A Woodwork ColumnSchema
quantity_schema = quantity.ww.schema
quantity_schema

[14]:

<ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>

The column_schema object above can be understood as typing information for a single column that is not tied to any data. In this case, we happen to know where the column schema came from - it was the quantity column from the retail_df DataFrame. But we can also create a ColumnSchema that exists without being associated with any individual column of data.

If we look again at the retail_df table as a whole, we can see the similarities and differences between the columns, and we can describe those subsets of the DataFrame with ColumnSchema objects, or type spaces.

[15]:

retail_df.ww.schema

[15]:

	Logical Type	Semantic Tag(s)
Column
order_product_id	Categorical	['index']
order_id	Categorical	['category']
product_id	Categorical	['category']
description	NaturalLanguage	[]
quantity	Integer	['numeric']
order_date	Datetime	['time_index']
unit_price	Double	['numeric']
customer_name	Categorical	['category']
country	Categorical	['category']
total	Double	['numeric']
cancelled	Boolean	[]

Below are several ColumnSchemas that all would include our quantity column, but each of them describe a different type space. These ColumnSchemas get more restrictive as we go down:

<ColumnSchema > - No restrictions have been placed; any column falls into this definition.
<ColumnSchema (Semantic Tags = ['numeric'])> - Only columns with the numeric tag apply. This can include Double, Integer, and Age logical type columns as well.
<ColumnSchema (Logical Type = Integer)> - Only columns with logical type of Integer are included in this definition. Does not require the numeric tag, so an index column (which has its standard tags removed) would still apply
<ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])> - The column must have logical type Integer and have the numeric semantic tag, excluding index columns.

In this way, a ColumnSchema can define a type space under which columns in a Woodwork DataFrame can fall.

Checking for nullable logical types#

Some logical types support having null values in the underlying data while others do not. This is entirely based on whether a logical type’s underlying primary_dtype supports null values. For example, the EmailAddress logical type has an underlying primary dtype of string. Pandas allows series with the dtype string to contain null values marked by the pandas.NA sentinel. Therefore, EmailAddress supports null values. On the other hand, the Integer logical type does not support null values since its underlying primary pandas dtype is int64. Pandas does not allow null values in series with the dtype int64. However, pandas does allow null values in series with the dtype Int64. Therefore, the IntegerNullable logical type supports null values since its primary dtype is Int64.

You can check if a column contains a nullable logical type by using nullable on the column accessor. The sections above that describe each type’s characteristics include information about whether or not a logical type is nullable.

[16]:

df.ww["bools_nullable"].ww.nullable

[16]:

True