Customizing Logical Types and Type Inference¶
The default type system in Woodwork contains many built-in LogicalTypes that work for a wide variety of datasets. For situations in which the built-in LogicalTypes are not sufficient, Woodwork allows you to create custom LogicalTypes.
Woodwork also has a set of standard type inference functions that can help in automatically identifying correct LogicalTypes in the data. Woodwork also allows you to override these existing functions, or add new functions for inferring any custom LogicalTypes that are added.
This guide provides an overview of how to create custom LogicalTypes as well as how to override and add new type inference functions. If you need to learn more about the existing types and tags in Woodwork, refer to the Understanding Logical Types and Semantic Tags guide for more detail. If you need to learn more about how to set and update these types and tags on a DataFrame, refer to the Working with Types and Tags guide for more detail.
Viewing Built-In Logical Types¶
To view all of the default LogicalTypes in Woodwork, use the list_logical_types
function. If the existing types are not sufficient for your needs, you can create and register new LogicalTypes for use with Woodwork initialized DataFrames and Series.
[1]:
import woodwork as ww
ww.list_logical_types()
[1]:
name | type_string | description | physical_type | standard_tags | is_default_type | is_registered | parent_type | |
---|---|---|---|---|---|---|---|---|
0 | Address | address | Represents Logical Types that contain address ... | string | {} | True | True | None |
1 | Age | age | Represents Logical Types that contain whole nu... | int64 | {numeric} | True | True | Integer |
2 | AgeFractional | age_fractional | Represents Logical Types that contain non-nega... | float64 | {numeric} | True | True | Double |
3 | AgeNullable | age_nullable | Represents Logical Types that contain whole nu... | Int64 | {numeric} | True | True | IntegerNullable |
4 | Boolean | boolean | Represents Logical Types that contain binary v... | bool | {} | True | True | BooleanNullable |
5 | BooleanNullable | boolean_nullable | Represents Logical Types that contain binary v... | boolean | {} | True | True | None |
6 | Categorical | categorical | Represents Logical Types that contain unordere... | category | {category} | True | True | None |
7 | CountryCode | country_code | Represents Logical Types that use the ISO-3166... | category | {category} | True | True | Categorical |
8 | Datetime | datetime | Represents Logical Types that contain date and... | datetime64[ns] | {} | True | True | None |
9 | Double | double | Represents Logical Types that contain positive... | float64 | {numeric} | True | True | None |
10 | EmailAddress | email_address | Represents Logical Types that contain email ad... | string | {} | True | True | None |
11 | Filepath | filepath | Represents Logical Types that specify location... | string | {} | True | True | None |
12 | IPAddress | ip_address | Represents Logical Types that contain IP addre... | string | {} | True | True | None |
13 | Integer | integer | Represents Logical Types that contain positive... | int64 | {numeric} | True | True | IntegerNullable |
14 | IntegerNullable | integer_nullable | Represents Logical Types that contain positive... | Int64 | {numeric} | True | True | None |
15 | LatLong | lat_long | Represents Logical Types that contain latitude... | object | {} | True | True | None |
16 | NaturalLanguage | natural_language | Represents Logical Types that contain text or ... | string | {} | True | True | None |
17 | Ordinal | ordinal | Represents Logical Types that contain ordered ... | category | {category} | True | True | Categorical |
18 | PersonFullName | person_full_name | Represents Logical Types that may contain firs... | string | {} | True | True | None |
19 | PhoneNumber | phone_number | Represents Logical Types that contain numeric ... | string | {} | True | True | None |
20 | PostalCode | postal_code | Represents Logical Types that contain a series... | category | {category} | True | True | Categorical |
21 | SubRegionCode | sub_region_code | Represents Logical Types that use the ISO-3166... | category | {category} | True | True | Categorical |
22 | Timedelta | timedelta | Represents Logical Types that contain values s... | timedelta64[ns] | {} | True | True | None |
23 | URL | url | Represents Logical Types that contain URLs, wh... | string | {} | True | True | None |
24 | Unknown | unknown | Represents Logical Types that cannot be inferr... | string | {} | True | True | None |
Registering a New LogicalType¶
The first step in registering a new LogicalType is to define the class for the new type. This is done by sub-classing the built-in LogicalType
class. There are a few class attributes that should be set when defining this new class. Each is reviewed in more detail below.
For this example, you will work through an example for a dataset that contains UPC Codes. First create a new UPCCode
LogicalType. For this example, consider the UPC Code to be a type of categorical variable.
[2]:
from woodwork.logical_types import LogicalType
class UPCCode(LogicalType):
"""Represents Logical Types that contain 12-digit UPC Codes."""
primary_dtype = 'category'
backup_dtype = 'string'
standard_tags = {'category', 'upc_code'}
When defining the UPCCode
LogicalType class, three class attributes were set. All three of these attributes are optional, and will default to the values defined on the LogicalType
class if they are not set when defining the new type.
primary_dtype
: This value specifies how the data will be stored. If the column of the dataframe is not already of this type, Woodwork will convert the data to this dtype. This should be specified as a string that represents a valid pandas dtype. If not specified, this will default to'string'
.backup_dtype
: This is primarily useful when working with Koalas dataframes.backup_dtype
specifies the dtype to use if Woodwork is unable to convert to the dtype specified byprimary_dtype
. In our example, we set this to'string'
since Koalas does not currently support the'category'
dtype.standard_tags
: This is a set of semantic tags to apply to any column that is set with the specified LogicalType. If not specified,standard_tags
will default to an empty set.docstring: Adding a docstring for the class is optional, but if specified, this text will be used for adding a description of the type in the list of available types returned by
ww.list_logical_types()
.
Note
Behind the scenes, Woodwork uses the category
and numeric
semantic tags to determine whether a column is categorical or numeric column, respectively. If the new LogicalType you define represents a categorical or numeric type, you should include the appropriate tag in the set of tags specified for standard_tags
.
Now that you have created the new LogicalType, you can register it with the Woodwork type system so you can use it. All modifications to the type system are performed by calling the appropriate method on the ww.type_system
object.
[3]:
ww.type_system.add_type(UPCCode, parent='Categorical')
If you once again list the available LogicalTypes, you will see the new type you created was added to the list, including the values for description, physical_type and standard_tags specified when defining the UPCCode
LogicalType.
[4]:
ww.list_logical_types()
[4]:
name | type_string | description | physical_type | standard_tags | is_default_type | is_registered | parent_type | |
---|---|---|---|---|---|---|---|---|
0 | Address | address | Represents Logical Types that contain address ... | string | {} | True | True | None |
1 | Age | age | Represents Logical Types that contain whole nu... | int64 | {numeric} | True | True | Integer |
2 | AgeFractional | age_fractional | Represents Logical Types that contain non-nega... | float64 | {numeric} | True | True | Double |
3 | AgeNullable | age_nullable | Represents Logical Types that contain whole nu... | Int64 | {numeric} | True | True | IntegerNullable |
4 | Boolean | boolean | Represents Logical Types that contain binary v... | bool | {} | True | True | BooleanNullable |
5 | BooleanNullable | boolean_nullable | Represents Logical Types that contain binary v... | boolean | {} | True | True | None |
6 | Categorical | categorical | Represents Logical Types that contain unordere... | category | {category} | True | True | None |
7 | CountryCode | country_code | Represents Logical Types that use the ISO-3166... | category | {category} | True | True | Categorical |
8 | Datetime | datetime | Represents Logical Types that contain date and... | datetime64[ns] | {} | True | True | None |
9 | Double | double | Represents Logical Types that contain positive... | float64 | {numeric} | True | True | None |
10 | EmailAddress | email_address | Represents Logical Types that contain email ad... | string | {} | True | True | None |
11 | Filepath | filepath | Represents Logical Types that specify location... | string | {} | True | True | None |
12 | IPAddress | ip_address | Represents Logical Types that contain IP addre... | string | {} | True | True | None |
13 | Integer | integer | Represents Logical Types that contain positive... | int64 | {numeric} | True | True | IntegerNullable |
14 | IntegerNullable | integer_nullable | Represents Logical Types that contain positive... | Int64 | {numeric} | True | True | None |
15 | LatLong | lat_long | Represents Logical Types that contain latitude... | object | {} | True | True | None |
16 | NaturalLanguage | natural_language | Represents Logical Types that contain text or ... | string | {} | True | True | None |
17 | Ordinal | ordinal | Represents Logical Types that contain ordered ... | category | {category} | True | True | Categorical |
18 | PersonFullName | person_full_name | Represents Logical Types that may contain firs... | string | {} | True | True | None |
19 | PhoneNumber | phone_number | Represents Logical Types that contain numeric ... | string | {} | True | True | None |
20 | PostalCode | postal_code | Represents Logical Types that contain a series... | category | {category} | True | True | Categorical |
21 | SubRegionCode | sub_region_code | Represents Logical Types that use the ISO-3166... | category | {category} | True | True | Categorical |
22 | Timedelta | timedelta | Represents Logical Types that contain values s... | timedelta64[ns] | {} | True | True | None |
23 | UPCCode | upc_code | Represents Logical Types that contain 12-digit... | category | {category, upc_code} | False | True | Categorical |
24 | URL | url | Represents Logical Types that contain URLs, wh... | string | {} | True | True | None |
25 | Unknown | unknown | Represents Logical Types that cannot be inferr... | string | {} | True | True | None |
Logical Type Relationships¶
When adding a new type to the type system, you can specify an optional parent LogicalType as done above. When performing type inference a given set of data might match multiple different LogicalTypes. Woodwork uses the parent-child relationship defined when registering a type to determine which type to infer in this case.
When multiple matches are found, Woodwork will return the most specific type match found. By setting the parent type to Categorical
when registering the UPCCode
LogicalType, you are telling Woodwork that if a data column matches both Categorical
and UPCCode
during inference, the column should be considered as UPCCode
as this is more specific than Categorical
. Woodwork always assumes that a child type is a more specific version of the parent type.
Working with Custom LogicalTypes¶
Next, you will create a small sample DataFrame to demonstrate use of the new custom type. This sample DataFrame includes an id column, a column with valid UPC Codes, and a column that should not be considered UPC Codes because it contains non-numeric values.
[5]:
import pandas as pd
df = pd.DataFrame({
'id': [0, 1, 2, 3],
'code': ['012345412359', '122345712358', '012345412359', '012345412359'],
'not_upc': ['abcdefghijkl', '122345712358', '012345412359', '022323413459']
})
Use a with block setting override to update Woodwork’s default threshold for differentiating between a Unknown
and Categorical
column so that Woodwork will correctly recognize the code
column as a Categorical
column. After setting the threshold, initialize Woodwork and verify that Woodwork has identified our column as Categorical
.
[6]:
with ww.config.with_options(categorical_threshold=0.5):
df.ww.init()
df.ww
[6]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
id | int64 | Integer | ['numeric'] |
code | category | Categorical | ['category'] |
not_upc | string | Unknown | [] |
The reason Woodwork did not identify the code
column to have a UPCCode
LogicalType, is that you have not yet defined an inference function to use with this type. The inference function is what tells Woodwork how to match columns to specific LogicalTypes.
Even without the inference function, you can manually tell Woodwork that the code
column should be of type UPCCode
. This will set the physical type properly and apply the standard semantic tags you have defined
[7]:
df.ww.init(logical_types = {'code': 'UPCCode'})
df.ww
[7]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
id | int64 | Integer | ['numeric'] |
code | category | UPCCode | ['category', 'upc_code'] |
not_upc | string | Unknown | [] |
Next, add a new inference function and allow Woodwork to automatically set the correct type for the code
column.
Defining Custom Inference Functions¶
The first step in adding an inference function for the UPCCode
LogicalType is to define an appropriate function. Inference functions always accept a single parameter, a pandas.Series
. The function should return True
if the series is a match for the LogicalType for which the function is associated, or False
if the series is not a match.
For the UPCCode
LogicalType, define a function to check that all of the values in a column are 12 character strings that contain only numbers. Note, this function is for demonstration purposes only and may not catch all cases that need to be considered for properly identifying a UPC Code.
[8]:
def infer_upc_code(series):
# Make sure series contains only strings:
if not series.apply(type).eq(str).all():
return False
# Check that all items are 12 characters long
if all(series.str.len() == 12):
# Try to convert to a number
try:
series.astype('int')
return True
except:
return False
return False
After defining the new UPC Code inference function, add it to the Woodwork type system so it can be used when inferring column types.
[9]:
ww.type_system.update_inference_function('UPCCode', inference_function=infer_upc_code)
After updating the inference function, you can reinitialize Woodwork on the DataFarme. Notice that Woodwork has correctly identified the code
column to have a LogicalType of UPCCode
and has correctly set the physical type and added the standard tags to the semantic tags for that column.
Also note that the not_upc
column was identified as Categorical
. Even though this column contains 12-digit strings, some of the values contain letters, and our inference function correctly told Woodwork this was not valid for the UPCCode
LogicalType.
[10]:
df.ww.init()
df.ww
[10]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
id | int64 | Integer | ['numeric'] |
code | category | UPCCode | ['category', 'upc_code'] |
not_upc | string | Unknown | [] |
Overriding Default Inference Functions¶
Overriding the default inference functions is done with the update_inference_function
TypeSystem method. Simply pass in the LogicalType for which you want to override the function, along with the new function to use.
For example you can tell Woodwork to use the new infer_upc_code
function for the built in Categorical
LogicalType.
[11]:
ww.type_system.update_inference_function('Categorical', inference_function=infer_upc_code)
If you initialize Woodwork on a DataFrame after updating the Categorical
function, you can see that the not_upc
column is no longer identified as a Categorical
column, but is rather set to the default Unknown
LogicalType. This is because the letters in the first row of the not_upc
column cause our inference function to return False
for this column, while the default Categorical
function will allow non-numeric values to be present. After updating the inference
function, this column is no longer considered a match for the Categorical
type, nor does the column match any other logical types. As a result, the LogicalType is set to Unknown
, the default type used when no type matches are found.
[12]:
df.ww.init()
df.ww
[12]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
id | int64 | Integer | ['numeric'] |
code | category | UPCCode | ['category', 'upc_code'] |
not_upc | string | Unknown | [] |
Updating LogicalType Relationships¶
If you need to change the parent for a registered LogicalType, you can do this with the update_relationship
method. Update the new UPCCode
LogicalType to be a child of NaturalLanguage
instead.
[13]:
ww.type_system.update_relationship('UPCCode', parent='NaturalLanguage')
The parent for a logical type can also be set to None
to indicate this is a root-level LogicalType that is not a child of any other existing LogicalType.
[14]:
ww.type_system.update_relationship('UPCCode', parent=None)
Setting the proper parent-child relationships between logical types is important. Because Woodwork will return the most specific LogicalType match found during inference, improper inference can occur if the relationships are not set correctly.
As an example, if you initialize Woodwork after setting the UPCCode
LogicalType to have a parent of None
, you will now see that the UPC Code column is inferred as Categorical
instead of UPCCode
. After setting the parent to None
, UPCCode
and Categorical
are now siblings in the relationship graph instead of having a parent-child relationship as they did previously. When Woodwork finds multiple matches on the same level in the relationship graph, the first match is
returned, which in this case is Categorical
. Without proper parent-child relationships set, Woodwork is unable to determine which LogicalType is most specific.
[15]:
df.ww.init()
df.ww
[15]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
id | int64 | Integer | ['numeric'] |
code | category | Categorical | ['category'] |
not_upc | string | Unknown | [] |
Removing a LogicalType¶
If a LogicalType is no longer needed, or is unwanted, it can be removed from the type system with the remove_type
method. When a LogicalType has been removed, a value of False
will be present in the is_registered
column for the type. If a LogicalType that has children is removed, all of the children types will have their parent set to the parent of the LogicalType that is being removed, assuming a parent was defined.
Remove the custom UPCCode
type and confirm it has been removed from the type system by listing the available LogicalTypes. You can confirm that the UPCCode
type will no longer be used because it will have a value of False
listed in the is_registered
column.
[16]:
ww.type_system.remove_type('UPCCode')
ww.list_logical_types()
[16]:
name | type_string | description | physical_type | standard_tags | is_default_type | is_registered | parent_type | |
---|---|---|---|---|---|---|---|---|
0 | Address | address | Represents Logical Types that contain address ... | string | {} | True | True | None |
1 | Age | age | Represents Logical Types that contain whole nu... | int64 | {numeric} | True | True | Integer |
2 | AgeFractional | age_fractional | Represents Logical Types that contain non-nega... | float64 | {numeric} | True | True | Double |
3 | AgeNullable | age_nullable | Represents Logical Types that contain whole nu... | Int64 | {numeric} | True | True | IntegerNullable |
4 | Boolean | boolean | Represents Logical Types that contain binary v... | bool | {} | True | True | BooleanNullable |
5 | BooleanNullable | boolean_nullable | Represents Logical Types that contain binary v... | boolean | {} | True | True | None |
6 | Categorical | categorical | Represents Logical Types that contain unordere... | category | {category} | True | True | None |
7 | CountryCode | country_code | Represents Logical Types that use the ISO-3166... | category | {category} | True | True | Categorical |
8 | Datetime | datetime | Represents Logical Types that contain date and... | datetime64[ns] | {} | True | True | None |
9 | Double | double | Represents Logical Types that contain positive... | float64 | {numeric} | True | True | None |
10 | EmailAddress | email_address | Represents Logical Types that contain email ad... | string | {} | True | True | None |
11 | Filepath | filepath | Represents Logical Types that specify location... | string | {} | True | True | None |
12 | IPAddress | ip_address | Represents Logical Types that contain IP addre... | string | {} | True | True | None |
13 | Integer | integer | Represents Logical Types that contain positive... | int64 | {numeric} | True | True | IntegerNullable |
14 | IntegerNullable | integer_nullable | Represents Logical Types that contain positive... | Int64 | {numeric} | True | True | None |
15 | LatLong | lat_long | Represents Logical Types that contain latitude... | object | {} | True | True | None |
16 | NaturalLanguage | natural_language | Represents Logical Types that contain text or ... | string | {} | True | True | None |
17 | Ordinal | ordinal | Represents Logical Types that contain ordered ... | category | {category} | True | True | Categorical |
18 | PersonFullName | person_full_name | Represents Logical Types that may contain firs... | string | {} | True | True | None |
19 | PhoneNumber | phone_number | Represents Logical Types that contain numeric ... | string | {} | True | True | None |
20 | PostalCode | postal_code | Represents Logical Types that contain a series... | category | {category} | True | True | Categorical |
21 | SubRegionCode | sub_region_code | Represents Logical Types that use the ISO-3166... | category | {category} | True | True | Categorical |
22 | Timedelta | timedelta | Represents Logical Types that contain values s... | timedelta64[ns] | {} | True | True | None |
23 | UPCCode | upc_code | Represents Logical Types that contain 12-digit... | category | {category, upc_code} | False | False | None |
24 | URL | url | Represents Logical Types that contain URLs, wh... | string | {} | True | True | None |
25 | Unknown | unknown | Represents Logical Types that cannot be inferr... | string | {} | True | True | None |
Resetting Type System to Defaults¶
Finally, if you made multiple changes to the default Woodwork type system and would like to reset everything back to the default state, you can use the reset_defaults
method as shown below. This unregisters any new types you have registered, resets all relationships to their default values and sets all inference functions back to their default functions.
[17]:
ww.type_system.reset_defaults()
Overriding Default LogicalTypes¶
There may be times when you would like to override Woodwork’s default LogicalTypes. An example might be if you wanted to use the nullable Int64
dtype for the Integer
LogicalType instead of the default dtype of int64
. In this case, you want to stop Woodwork from inferring the default Integer
LogicalType and have a compatible Logical Type inferred instead. You may solve this issue in one of two ways.
First, you can create an entirely new LogicalType with its own name, MyInteger
, and register it in the TypeSystem. If you want to infer it in place of the normal Integer
LogicalType, you would remove Integer
from the type system, and use Integer
’s default inference function for MyInteger
. Doing this will make it such that MyInteger
will get inferred any place that Integer
would have previously. Note, that because Integer
has a parent LogicalType of
IntegerNullable
, you also need to set the parent of MyInteger
to be IntegerNullable
when registering with the type system.
[18]:
from woodwork.logical_types import LogicalType
class MyInteger(LogicalType):
primary_dtype = 'Int64'
standard_tags = {'numeric'}
int_inference_fn = ww.type_system.inference_functions[ww.logical_types.Integer]
ww.type_system.remove_type(ww.logical_types.Integer)
ww.type_system.add_type(MyInteger, int_inference_fn, parent='IntegerNullable')
df.ww.init()
df.ww
[18]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
id | Int64 | MyInteger | ['numeric'] |
code | category | Categorical | ['category'] |
not_upc | string | Unknown | [] |
Above, you can see that the id
column, which was previously inferred as Integer
is now inferred as MyInteger
with the Int64
physical type. In the full list of Logical Types at ww.list_logical_types()
, Integer
and MyInteger
will now both be present, but Integer
’s is_registered
will be False while the value for is_registered
for MyInteger
will be set to True.
The second option for overriding the default Logical Types allows you to create a new LogicalType with the same name as an existing one. This might be desirable because it will allow Woodwork to interpret the string 'Integer'
as your new LogicalType, allowing previous code that might have selected 'Integer'
to be used without updating references to a new LogicalType like MyInteger
.
Before adding a LogicalType whose name already exists into the TypeSystem, you must first unregister the default LogicalType.
In order to avoid using the same name space locally between Integer LogicalTypes, it is recommended to reference Woodwork’s default LogicalType as ww.logical_types.Integer
.
[19]:
ww.type_system.reset_defaults()
class Integer(LogicalType):
primary_dtype = 'Int64'
standard_tags = {'numeric'}
int_inference_fn = ww.type_system.inference_functions[ww.logical_types.Integer]
ww.type_system.remove_type(ww.logical_types.Integer)
ww.type_system.add_type(Integer, int_inference_fn, parent='IntegerNullable')
df.ww.init()
display(df.ww)
ww.type_system.reset_defaults()
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
id | Int64 | Integer | ['numeric'] |
code | category | Categorical | ['category'] |
not_upc | string | Unknown | [] |
Notice how id
now gets inferred as an Integer
Logical Type that has Int64
as its Physical Type!