Understanding Logical Types and Semantic Tags#
In a Woodwork DataFrame, each column has three pieces of typing information associated with it: a physical type, a logical type, and semantic tags.
This guide offers an in-depth walk-through of all of the logical types and semantic tags that Woodwork defines in order to allow users to choose the logical types and semantic tags that most closely describe their data. As a reminder, here are quick definitions of Woodwork’s types:
Physical Type: defines how the data is stored on disk or in memory.
Logical Type: defines how the data should be parsed or interpreted.
Semantic Tag(s): provides additional data about the meaning of the data or how it should be used.
Woodwork will attempt to infer a column’s LogicalType
if none is supplied at initialization. A column’s logical type will then inform which physical type and standard semantic tags are applied to it. However, setting the types manually will allow for more accurate typing of a DataFrame.
Having accurate typing information on a Woodwork DataFrame impacts how the data is parsed, transformed, and later interpreted downstream of Woodwork initialization. Therefore, understanding Woodwork’s logical types and semantic tags is essential to downstream usage of Woodwork.
For an in-depth guide on how to set and manipulate these types, see the Working with Types and Tags guide.
For information on how to customize Woodwork’s type system, see the Custom Types and Inference guide.
It’s important to remember that Woodwork columns will always have a logical type and that any semantic tags that are added by Woodwork are meant to add additional meaning onto that logical type. We’ll start out by looking in-depth at semantic tags so that when we get to logical types, we can better understand how a semantic tag might add additional information onto it.
Semantic Tags#
Here is the full set of Woodwork-defined semantic tags:
[1]:
import woodwork as ww
ww.list_semantic_tags()
[1]:
name | is_standard_tag | valid_logical_types | |
---|---|---|---|
0 | numeric | True | [Age, AgeFractional, AgeNullable, Double, Inte... |
1 | category | True | [Categorical, CountryCode, CurrencyCode, Ordin... |
2 | index | False | Any LogicalType |
3 | time_index | False | [Datetime, Age, AgeFractional, AgeNullable, Do... |
4 | date_of_birth | False | [Datetime] |
5 | ignore | False | Any LogicalType |
6 | passthrough | False | Any LogicalType |
Standard Tags#
Standard tags are associated with specific logical types. They are useful for indicating predefined categories that logical types might fall into.
'numeric'
- Is applied to any numeric logical typeUses: Can select for just numeric columns when performing operations that require numeric columns
Related Properties:
series.ww.is_numeric
'category'
- Is applied to any logical type that is categorical in natureUses: Can select for just categorical columns when performing operations that require categorical columns
Related Properties:
series.ww.is_categorical
Index Tags#
Index tags are added by Woodwork to a DataFrame when an index
or time_index
column is identified by the user. These tags have some special properties that are only confirmed to be true in the context of a DataFrame (so any Series with these tags may not have these properties).
'index'
- Indicates that a column is the DataFrame’s index, or primary keyThere will only be one index column
The contents of an index column will be unique
An index column will have any standard semantic tags associated with its logical type removed
In pandas DataFrames, the data in an index column will be reflected in the DataFrame’s underlying index
'time_index'
There will only be one time index column
A time index column will contain either datetime or numeric data
Other Tags#
The tags listed below may be added directly to columns during or after Woodwork initialization. They are tags that have suggested meanings and that can be added to columns that will be used in the manner described below. Woodwork will neither add them automatically to a DataFrame nor take direct action upon a column if they are present.
'date_of_birth'
- Indicates that a datetime column should be parsed as a date of birth'ignore'
/'passthrough'
- Indicates that a column should be ignored during feature engineering or model building but should still be passed through these operations so that the column is not lost.
Additional tags beyond the ones Woodwork adds at initialization may be useful for a DataFrame’s interpretability, so users are encouraged to add any tags that will allow them to use their data more efficiently.
Logical Types#
Below are all of the Logical Types that Woodwork defines.
[2]:
import woodwork as ww
ww.list_logical_types()
[2]:
name | type_string | description | physical_type | standard_tags | is_default_type | is_registered | parent_type | |
---|---|---|---|---|---|---|---|---|
0 | Address | address | Represents Logical Types that contain address ... | string | {} | True | True | None |
1 | Age | age | Represents Logical Types that contain whole nu... | int64 | {numeric} | True | True | Integer |
2 | AgeFractional | age_fractional | Represents Logical Types that contain non-nega... | float64 | {numeric} | True | True | Double |
3 | AgeNullable | age_nullable | Represents Logical Types that contain whole nu... | Int64 | {numeric} | True | True | IntegerNullable |
4 | Boolean | boolean | Represents Logical Types that contain binary v... | bool | {} | True | True | BooleanNullable |
5 | BooleanNullable | boolean_nullable | Represents Logical Types that contain binary v... | boolean | {} | True | True | None |
6 | Categorical | categorical | Represents Logical Types that contain unordere... | category | {category} | True | True | None |
7 | CountryCode | country_code | Represents Logical Types that use the ISO-3166... | category | {category} | True | True | Categorical |
8 | CurrencyCode | currency_code | Represents Logical Types that use the ISO-4217... | category | {category} | True | True | Categorical |
9 | Datetime | datetime | Represents Logical Types that contain date and... | datetime64[ns] | {} | True | True | None |
10 | Double | double | Represents Logical Types that contain positive... | float64 | {numeric} | True | True | None |
11 | EmailAddress | email_address | Represents Logical Types that contain email ad... | string | {} | True | True | Unknown |
12 | Filepath | filepath | Represents Logical Types that specify location... | string | {} | True | True | None |
13 | IPAddress | ip_address | Represents Logical Types that contain IP addre... | string | {} | True | True | Unknown |
14 | Integer | integer | Represents Logical Types that contain positive... | int64 | {numeric} | True | True | IntegerNullable |
15 | IntegerNullable | integer_nullable | Represents Logical Types that contain positive... | Int64 | {numeric} | True | True | None |
16 | LatLong | lat_long | Represents Logical Types that contain latitude... | object | {} | True | True | None |
17 | NaturalLanguage | natural_language | Represents Logical Types that contain text or ... | string | {} | True | True | None |
18 | Ordinal | ordinal | Represents Logical Types that contain ordered ... | category | {category} | True | True | Categorical |
19 | PersonFullName | person_full_name | Represents Logical Types that may contain firs... | string | {} | True | True | None |
20 | PhoneNumber | phone_number | Represents Logical Types that contain numeric ... | string | {} | True | True | Unknown |
21 | PostalCode | postal_code | Represents Logical Types that contain a series... | category | {category} | True | True | Categorical |
22 | SubRegionCode | sub_region_code | Represents Logical Types that use the ISO-3166... | category | {category} | True | True | Categorical |
23 | Timedelta | timedelta | Represents Logical Types that contain values s... | timedelta64[ns] | {} | True | True | Unknown |
24 | URL | url | Represents Logical Types that contain URLs, wh... | string | {} | True | True | Unknown |
25 | Unknown | unknown | Represents Logical Types that cannot be inferr... | string | {} | True | True | None |
In the DataFrame above, we can see a parent_type
column. The parent_type
of a LogicalType
refers to a logical type that is a more general version of the current LogicalType
. See the Custom Types and Type Inference guide for more details on how parent-child relationships between logical types impacts Woodwork’s type inference.
Base LogicalType Class#
All logical types used by Woodwork are subclassed off of the base LogicalType
class, and since the following behaviors all exist on the LogicalType
class, all logical types have the following behavior:
All logical types define a
dtype
that will get used for any column with that logical type - this is how the physical type for a column gets determinedAll logical types perform some basic transformation into the expected physical type (
dtype
) - this is how Woodwork LogicalTypes act as a form of data-transformers. Depending on the requirements of a LogicalType, a LogicalType can transform input data into an expected format.class LogicalType(object, metaclass=LogicalTypeMetaClass): """Base class for all other Logical Types""" type_string = ClassNameDescriptor() primary_dtype = 'string' standard_tags = set()
Default Logical Type#
Unknown#
When Woodwork’s type inference does not return any LogicalTypes for a column, Woodwork will set the column’s logical type as the default LogicalType, Unknown
. A logical type being inferred as Unknown
may be a good indicator that a more specific logical type can be chosen and set by the user.
physical type:
string
Below is an example of a column for which no logical type is inferred, resulting in a Series with Unknown
logical type. Looking at the contents of the Series, though, we can see that it contains country codes, so we set the logical type to CountryCode
.
[3]:
import pandas as pd
series = pd.Series(["AU", "US", "UA"])
unknown_series = ww.init_series(series)
unknown_series.ww
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[3]:
<Series: None (Physical Type = string) (Logical Type = Unknown) (Semantic Tags = set())>
[4]:
countrycode_series = ww.init_series(unknown_series, "CountryCode")
countrycode_series.ww
[4]:
<Series: None (Physical Type = category) (Logical Type = CountryCode) (Semantic Tags = {'category'})>
Numeric Logical Types#
Age#
Represents Logical Types that contain whole numbers indicating a person’s age.
physical type:
int64
standard tags:
{'numeric'}
AgeFractional#
Represents Logical Types that contain non-negative floating point numbers indicating a person’s age. May contain null values.
physical type:
float64
standard tags:
{'numeric'}
AgeNullable#
Represents Logical Types that contain whole numbers indicating a person’s age. May contain null values.
physical type:
Int64
standard tags:
{'numeric'}
Double#
Represents Logical Types that contain positive and negative numbers, some of which include a fractional component.
physical type:
float64
standard tags:
{'numeric'}
Integer#
Represents Logical Types that contain positive and negative numbers without a fractional component, including zero (0).
physical type:
int64
standard tags:
{'numeric'}
IntegerNullable#
Represents Logical Types that contain positive and negative numbers without a fractional component, including zero (0). May contain null values.
physical type:
Int64
standard tags:
{'numeric'}
Below we’ll find a dataframe with examples of each of the numeric LogicalTypes
[5]:
numerics_df = pd.DataFrame(
{
"ints": [1, 2, 3, 4],
"ints_nullable": pd.Series([1, 2, None, 4], dtype="Int64"),
"floats": [0.0, 1.1, 2.2, 3.3],
"ages": [18, 22, 24, 34],
"ages_nullable": [None, 2, 22, 33],
}
)
numerics_df.ww.init(logical_types={"ages": "Age", "ages_nullable": "AgeNullable"})
numerics_df.ww
[5]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
ints | int64 | Integer | ['numeric'] |
ints_nullable | Int64 | IntegerNullable | ['numeric'] |
floats | float64 | Double | ['numeric'] |
ages | int64 | Age | ['numeric'] |
ages_nullable | Int64 | AgeNullable | ['numeric'] |
Categorical Logical Types#
Categorical#
Represents a Logical Type with few unique values relative to the size of the data.
physical type:
category
inference: Woodwork defines a threshold for percentage unique values relative to the size of the series below which a series will be considered categorical. See setting config options guide for more information on how to control this threshold.
Some examples of data for which the Categorical logical type would apply:
Gender
Eye Color
Nationality
Hair Color
Spoken Language
CountryCode#
Represents Logical Types that use the ISO-3166 standard country code to represent countries. ISO 3166-1 (countries) are supported. These codes should be in the Alpha-2 format.
physical type:
category
standard tags:
{'category'}
For example: 'AU'
for Australia, 'CN'
for China, and 'CA'
for Canada.
Ordinal#
A Ordinal variable type can take ordered discrete values. Similar to Categorical, it is usually a limited, and fixed number of possible values. However, these discrete values have a certain order, and the ordering is important to understanding the values. Ordinal variable types can be represented as strings, or integers.
physical type:
category
standard tags:
{'category'}
parameters:
order
- the order of the ordinal values in the column from low to high
validation - an order must be defined for an Ordinal column on a DataFrame or Series, and all elements of the order must be present.
Some examples of data for which the Ordinal logical type would apply:
Educational Background (Elementary, High School, Undergraduate, Graduate)
Satisfaction Rating (Not Satisfied, Satisfied, Very Satisfied)
Spicy Level (Hot, Hotter, Hottest)
Student Grade (A, B, C, D, F)
Size (small, medium, large)
PostalCode#
Represents Logical Types that contain a series of postal codes for representing a group of addresses.
physical type:
category
standard tags:
{'category'}
SubRegionCode#
Represents Logical Types that use the ISO-3166 standard sub-region code to represent a portion of a larger geographic region. ISO 3166-2 (sub-regions) codes are supported. These codes should be in the Alpha-2 format.
physical type:
category
standard tags:
{'category'}
For example: 'US-IL'
to represent Illinois in the United States or 'AU-TAS'
to represent Tasmania in Australia.
[6]:
categoricals_df = pd.DataFrame(
{
"categorical": pd.Series(["a", "b", "a", "a"], dtype="category"),
"ordinal": ["small", "large", "large", "medium"],
"country_code": ["AU", "US", "UA", "AU"],
"postal_code": ["90210", "60035", "SW1A", "90210"],
"sub_region_code": ["AU-NSW", "AU-TAS", "AU-QLD", "AU-QLD"],
}
)
categoricals_df.ww.init(
logical_types={
"ordinal": ww.logical_types.Ordinal(order=["small", "medium", "large"]),
"country_code": "CountryCode",
"postal_code": "PostalCode",
"sub_region_code": "SubRegionCode",
}
)
categoricals_df.ww
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[6]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
categorical | category | Categorical | ['category'] |
ordinal | category | Ordinal: ['small', 'medium', 'large'] | ['category'] |
country_code | category | CountryCode | ['category'] |
postal_code | category | PostalCode | ['category'] |
sub_region_code | category | SubRegionCode | ['category'] |
Miscellaneous Logical Types with Specific Formats#
Boolean#
Represents Logical Types that contain binary values indicating true/false.
physical type:
bool
BooleanNullable#
Represents Logical Types that contain binary values indicating true/false. May also contain null values.
physical type:
boolean
Datetime#
A Datetime is a representation of a date and/or time. Datetime variable types can be represented as strings, or integers.
physical type:
datetime64[ns]
transformation: Will convert valid strings or numbers to pandas datetimes, and will parse more datetime formats with the use of the
datetime_format
parameter.parameters:
datetime_format
- the format of the datetimes in the column, ex:'%Y-%m-%d'
vs'%m-%d-%Y'
Some examples of Datetime include:
Transaction Time
Flight Departure Time
Pickup Time
EmailAddress#
Represents Logical Types that contain email address values.
physical type:
string
inference: Uses an email address regex that, if the data matches, means that the column contains email addresses. To learn more about controling the regex used, see the setting config options guide.
LatLong#
A LatLong represents an ordered pair (Latitude, Longitude) that tells the location on Earth. The order of the tuple is important. LatLongs can be represented as tuple of floating point numbers.
physical type:
object
transformation: Will convert inputs into a tuple of floats. Any null values will be stored as
np.nan
Timedelta#
Represents Logical Types that contain values specifying a duration of time.
physical type:
timedelta64[ns]
Examples could inclue:
Days/months/years since some event
How long a flight’s arrival was delayed/early
Days until birthday
Below, we’ll see a DataFrame that contains data for each of these logical types. Some columns like dates
and latlongs
will have their data transformed to a format that Woodwork expects.
[7]:
df = pd.DataFrame(
{
"dates": ["2019/01/01", "2019/01/02", "2019/01/03", "2019/01/03"],
"latlongs": [
"[33.670914, -117.841501]",
"40.423599, -86.921162",
(-45.031705, None),
None,
],
"booleans": [True, True, False, True],
"bools_nullable": pd.Series([True, False, True, None], dtype="boolean"),
"timedelta": [
pd.Timedelta("1 days 00:00:00"),
pd.Timedelta("-1 days +23:40:00"),
pd.Timedelta("4 days 12:00:00"),
pd.Timedelta("-1 days +23:40:00"),
],
"emails": [
"[email protected]",
"[email protected]",
"[email protected]",
"[email protected]",
],
}
)
df
[7]:
dates | latlongs | booleans | bools_nullable | timedelta | emails | |
---|---|---|---|---|---|---|
0 | 2019/01/01 | [33.670914, -117.841501] | True | True | 1 days 00:00:00 | [email protected] |
1 | 2019/01/02 | 40.423599, -86.921162 | True | False | -1 days +23:40:00 | [email protected] |
2 | 2019/01/03 | (-45.031705, None) | False | True | 4 days 12:00:00 | [email protected] |
3 | 2019/01/03 | None | True | <NA> | -1 days +23:40:00 | [email protected] |
[8]:
df.ww.init(
logical_types={
"latlongs": "LatLong",
"dates": ww.logical_types.Datetime(datetime_format="%Y/%m/%d"),
}
)
df.ww
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[8]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
dates | datetime64[ns] | Datetime | [] |
latlongs | object | LatLong | [] |
booleans | bool | Boolean | [] |
bools_nullable | boolean | BooleanNullable | [] |
timedelta | timedelta64[ns] | Timedelta | [] |
emails | string | EmailAddress | [] |
[9]:
df
[9]:
dates | latlongs | booleans | bools_nullable | timedelta | emails | |
---|---|---|---|---|---|---|
0 | 2019-01-01 | (33.670914, -117.841501) | True | True | 1 days 00:00:00 | [email protected] |
1 | 2019-01-02 | (40.423599, -86.921162) | True | False | -1 days +23:40:00 | [email protected] |
2 | 2019-01-03 | (-45.031705, nan) | False | True | 4 days 12:00:00 | [email protected] |
3 | 2019-01-03 | NaN | True | <NA> | -1 days +23:40:00 | [email protected] |
String Logical Types#
NaturalLanguage#
Represents Logical Types that contain long-form text or characters representing natural human language
physical type:
string
Examples of natural language data:
“Any additional comments” in a feedback form
Customer Review
Patient Notes
Address#
Represents Logical Types that contain address values.
physical type:
string
Filepath#
Represents Logical Types that specify locations of directories and files in a file system.
physical type:
string
PersonFullName#
Represents Logical Types that may contain first, middle and last names, including honorifics and suffixes.
physical type:
string
PhoneNumber#
Represents Logical Types that contain numeric digits and characters representing a phone number.
physical type:
string
URL#
Represents Logical Types that contain URLs, which may include protocol, hostname and file name.
physical type:
string
IPAddress#
Represents Logical Types that contain IP addresses, including both IPv4 and IPv6 addresses.
physical type:
string
[10]:
strings_df = pd.DataFrame(
{
"natural_language": [
"This is a short sentence.",
"I like to eat pizza!",
"When will humans go to mars?",
"This entry contains two sentences. Second sentence.",
],
"addresses": [
"1 Miller Drive, New York, NY 12345",
"1 Berkeley Street, Boston, MA 67891",
"26387 Russell Hill, Dallas, TX 34521",
"54305 Oxford Street, Seattle, WA 95132",
],
"filepaths": [
"/usr/local/bin",
"/Users/john.smith/dev/index.html",
"/tmp",
"../woodwork",
],
"full_names": [
"Mr. John Doe, Jr.",
"Doe, Mrs. Jane",
"James Brown",
"John Smith",
],
"phone_numbers": [
"1-(555)-123-5495",
"+1-555-123-5495",
"5551235495",
"111-222-3333",
],
"urls": [
"http://google.com",
"https://example.com/index.html",
"example.com",
"https://woodwork.alteryx.com/",
],
"ip_addresses": [
"172.16.254.1",
"192.0.0.0",
"2001:0db8:0000:0000:0000:ff00:0042:8329",
"192.0.0.0",
],
}
)
strings_df.ww.init(
logical_types={
"natural_language": "NaturalLanguage",
"addresses": "Address",
"filepaths": "FilePath",
"full_names": "PersonFullName",
"phone_numbers": "PhoneNumber",
"urls": "URL",
"ip_addresses": "IPAddress",
}
)
strings_df.ww
[10]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
natural_language | string | NaturalLanguage | [] |
addresses | string | Address | [] |
filepaths | string | Filepath | [] |
full_names | string | PersonFullName | [] |
phone_numbers | string | PhoneNumber | [] |
urls | string | URL | [] |
ip_addresses | string | IPAddress | [] |
ColumnSchema objects#
Now that we’ve gone in-depth on semantic tags and logical types, we can start to understand how they’re used together to build Woodwork tables and define type spaces.
A ColumnSchema
is the typing information for a single column. We can obtain a ColumnSchema
from a Woodwork-initialized DataFrame as follows:
[11]:
# Woodwork typing info for a DataFrame
retail_df = ww.demo.load_retail()
retail_df.ww
[11]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
order_product_id | category | Categorical | ['index'] |
order_id | category | Categorical | ['category'] |
product_id | category | Categorical | ['category'] |
description | string | NaturalLanguage | [] |
quantity | int64 | Integer | ['numeric'] |
order_date | datetime64[ns] | Datetime | ['time_index'] |
unit_price | float64 | Double | ['numeric'] |
customer_name | category | Categorical | ['category'] |
country | category | Categorical | ['category'] |
total | float64 | Double | ['numeric'] |
cancelled | bool | Boolean | [] |
Above is the typing information for a Woodwork DataFrame. If we want, we can access just the schema of typing information outside of the context of the actual data in the DataFrame.
[12]:
# A Woodwork TableSchema
retail_df.ww.schema
[12]:
Logical Type | Semantic Tag(s) | |
---|---|---|
Column | ||
order_product_id | Categorical | ['index'] |
order_id | Categorical | ['category'] |
product_id | Categorical | ['category'] |
description | NaturalLanguage | [] |
quantity | Integer | ['numeric'] |
order_date | Datetime | ['time_index'] |
unit_price | Double | ['numeric'] |
customer_name | Categorical | ['category'] |
country | Categorical | ['category'] |
total | Double | ['numeric'] |
cancelled | Boolean | [] |
The representation of the woodwork.table_schema.TableSchema
is only different in that it does not have a column for the physical types.
This lack of a physical type is due to the fact that a TableSchema
has no data, and therefore no physical representation of the data. We often rely on physical typing information to know the exact pandas operations that are valid for a DataFrame, but for a schema of typing information that is not tied to data, those operations are not relevant.
Now, let’s look at a single column of typing information, or a woodwork.column_schema.ColumnSchema
that we can aquire in much the same way as we can select a Series from the DataFrame:
[13]:
# Woodwork typing infor for a Series
quantity = retail_df.ww["quantity"]
quantity.ww
[13]:
<Series: quantity (Physical Type = int64) (Logical Type = Integer) (Semantic Tags = {'numeric'})>
[14]:
# A Woodwork ColumnSchema
quantity_schema = quantity.ww.schema
quantity_schema
[14]:
<ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>
The column_schema
object above can be understood as typing information for a single column that is not tied to any data. In this case, we happen to know where the column schema came from - it was the quantity
column from the retail_df
DataFrame. But we can also create a ColumnSchema
that exists without being associated with any individual column of data.
If we look again at the retail_df
table as a whole, we can see the similarities and differences between the columns, and we can describe those subsets of the DataFrame with ColumnSchema
objects, or type spaces.
[15]:
retail_df.ww.schema
[15]:
Logical Type | Semantic Tag(s) | |
---|---|---|
Column | ||
order_product_id | Categorical | ['index'] |
order_id | Categorical | ['category'] |
product_id | Categorical | ['category'] |
description | NaturalLanguage | [] |
quantity | Integer | ['numeric'] |
order_date | Datetime | ['time_index'] |
unit_price | Double | ['numeric'] |
customer_name | Categorical | ['category'] |
country | Categorical | ['category'] |
total | Double | ['numeric'] |
cancelled | Boolean | [] |
Below are several ColumnSchema
s that all would include our quantity
column, but each of them describe a different type space. These ColumnSchema
s get more restrictive as we go down:
<ColumnSchema >
- No restrictions have been placed; any column falls into this definition.<ColumnSchema (Semantic Tags = ['numeric'])>
- Only columns with thenumeric
tag apply. This can include Double, Integer, and Age logical type columns as well.<ColumnSchema (Logical Type = Integer)>
- Only columns with logical type ofInteger
are included in this definition. Does not require thenumeric
tag, so an index column (which has its standard tags removed) would still apply<ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>
- The column must have logical typeInteger
and have thenumeric
semantic tag, excluding index columns.
In this way, a ColumnSchema
can define a type space under which columns in a Woodwork DataFrame can fall.
Checking for nullable logical types#
Some logical types support having null values in the underlying data while others do not. This is entirely based on whether a logical type’s underlying primary_dtype
supports null values. For example, the EmailAddress
logical type has an underlying primary dtype of string
. Pandas allows series with the dtype string
to contain null values marked by the pandas.NA
sentinel. Therefore, EmailAddress
supports null values. On the other hand, the Integer
logical type does not
support null values since its underlying primary pandas dtype is int64
. Pandas does not allow null values in series with the dtype int64
. However, pandas does allow null values in series with the dtype Int64
. Therefore, the IntegerNullable
logical type supports null values since its primary dtype is Int64
.
You can check if a column contains a nullable logical type by using nullable
on the column accessor. The sections above that describe each type’s characteristics include information about whether or not a logical type is nullable.
[16]:
df.ww["bools_nullable"].ww.nullable
[16]:
True