Customizing Logical Types and Type Inference

The default type system in Woodwork contains many built-in LogicalTypes that work for a wide variety of datasets. For situations in which the built-in LogicalTypes are not sufficient, Woodwork allows you to create custom LogicalTypes.

Woodwork also has a set of standard type inference functions that can help in automatically identifying correct LogicalTypes in the data. Woodwork also allows you to override these existing functions, or add new functions for inferring any custom LogicalTypes that are added.

This guide provides an overview of how to create custom LogicalTypes as well as how to override and add new type inference functions. If you need to learn more about the existing types and tags in Woodwork, refer to the Understanding Logical Types and Semantic Tags guide for more detail. If you need to learn more about how to set and update these types and tags on a DataFrame, refer to the Working with Types and Tags guide for more detail.

Viewing Built-In Logical Types

To view all of the default LogicalTypes in Woodwork, use the list_logical_types function. If the existing types are not sufficient for your needs, you can create and register new LogicalTypes for use with Woodwork initialized DataFrames and Series.

[1]:
import woodwork as ww

ww.list_logical_types()
[1]:
name type_string description physical_type standard_tags is_default_type is_registered parent_type
0 Address address Represents Logical Types that contain address ... string {} True True None
1 Age age Represents Logical Types that contain whole nu... int64 {numeric} True True Integer
2 AgeFractional age_fractional Represents Logical Types that contain non-nega... float64 {numeric} True True Double
3 AgeNullable age_nullable Represents Logical Types that contain whole nu... Int64 {numeric} True True IntegerNullable
4 Boolean boolean Represents Logical Types that contain binary v... bool {} True True BooleanNullable
5 BooleanNullable boolean_nullable Represents Logical Types that contain binary v... boolean {} True True None
6 Categorical categorical Represents Logical Types that contain unordere... category {category} True True None
7 CountryCode country_code Represents Logical Types that use the ISO-3166... category {category} True True Categorical
8 CurrencyCode currency_code Represents Logical Types that use the ISO-4217... category {category} True True Categorical
9 Datetime datetime Represents Logical Types that contain date and... datetime64[ns] {} True True None
10 Double double Represents Logical Types that contain positive... float64 {numeric} True True None
11 EmailAddress email_address Represents Logical Types that contain email ad... string {} True True None
12 Filepath filepath Represents Logical Types that specify location... string {} True True None
13 IPAddress ip_address Represents Logical Types that contain IP addre... string {} True True None
14 Integer integer Represents Logical Types that contain positive... int64 {numeric} True True IntegerNullable
15 IntegerNullable integer_nullable Represents Logical Types that contain positive... Int64 {numeric} True True None
16 LatLong lat_long Represents Logical Types that contain latitude... object {} True True None
17 NaturalLanguage natural_language Represents Logical Types that contain text or ... string {} True True None
18 Ordinal ordinal Represents Logical Types that contain ordered ... category {category} True True Categorical
19 PersonFullName person_full_name Represents Logical Types that may contain firs... string {} True True None
20 PhoneNumber phone_number Represents Logical Types that contain numeric ... string {} True True None
21 PostalCode postal_code Represents Logical Types that contain a series... category {category} True True Categorical
22 SubRegionCode sub_region_code Represents Logical Types that use the ISO-3166... category {category} True True Categorical
23 Timedelta timedelta Represents Logical Types that contain values s... timedelta64[ns] {} True True None
24 URL url Represents Logical Types that contain URLs, wh... string {} True True None
25 Unknown unknown Represents Logical Types that cannot be inferr... string {} True True None

Registering a New LogicalType

The first step in registering a new LogicalType is to define the class for the new type. This is done by sub-classing the built-in LogicalType class. There are a few class attributes that should be set when defining this new class. Each is reviewed in more detail below.

For this example, you will work through an example for a dataset that contains UPC Codes. First create a new UPCCode LogicalType. For this example, consider the UPC Code to be a type of categorical variable.

[2]:
from woodwork.logical_types import LogicalType


class UPCCode(LogicalType):
    """Represents Logical Types that contain 12-digit UPC Codes."""

    primary_dtype = "category"
    backup_dtype = "string"
    standard_tags = {"category", "upc_code"}

When defining the UPCCode LogicalType class, three class attributes were set. All three of these attributes are optional, and will default to the values defined on the LogicalType class if they are not set when defining the new type.

  • primary_dtype: This value specifies how the data will be stored. If the column of the dataframe is not already of this type, Woodwork will convert the data to this dtype. This should be specified as a string that represents a valid pandas dtype. If not specified, this will default to 'string'.

  • backup_dtype: This is primarily useful when working with Spark dataframes. backup_dtype specifies the dtype to use if Woodwork is unable to convert to the dtype specified by primary_dtype. In our example, we set this to 'string' since Spark does not currently support the 'category' dtype.

  • standard_tags: This is a set of semantic tags to apply to any column that is set with the specified LogicalType. If not specified, standard_tags will default to an empty set.

  • docstring: Adding a docstring for the class is optional, but if specified, this text will be used for adding a description of the type in the list of available types returned by ww.list_logical_types().

Note

Behind the scenes, Woodwork uses the category and numeric semantic tags to determine whether a column is categorical or numeric column, respectively. If the new LogicalType you define represents a categorical or numeric type, you should include the appropriate tag in the set of tags specified for standard_tags.

Now that you have created the new LogicalType, you can register it with the Woodwork type system so you can use it. All modifications to the type system are performed by calling the appropriate method on the ww.type_system object.

[3]:
ww.type_system.add_type(UPCCode, parent="Categorical")

If you once again list the available LogicalTypes, you will see the new type you created was added to the list, including the values for description, physical_type and standard_tags specified when defining the UPCCode LogicalType.

[4]:
ww.list_logical_types()
[4]:
name type_string description physical_type standard_tags is_default_type is_registered parent_type
0 Address address Represents Logical Types that contain address ... string {} True True None
1 Age age Represents Logical Types that contain whole nu... int64 {numeric} True True Integer
2 AgeFractional age_fractional Represents Logical Types that contain non-nega... float64 {numeric} True True Double
3 AgeNullable age_nullable Represents Logical Types that contain whole nu... Int64 {numeric} True True IntegerNullable
4 Boolean boolean Represents Logical Types that contain binary v... bool {} True True BooleanNullable
5 BooleanNullable boolean_nullable Represents Logical Types that contain binary v... boolean {} True True None
6 Categorical categorical Represents Logical Types that contain unordere... category {category} True True None
7 CountryCode country_code Represents Logical Types that use the ISO-3166... category {category} True True Categorical
8 CurrencyCode currency_code Represents Logical Types that use the ISO-4217... category {category} True True Categorical
9 Datetime datetime Represents Logical Types that contain date and... datetime64[ns] {} True True None
10 Double double Represents Logical Types that contain positive... float64 {numeric} True True None
11 EmailAddress email_address Represents Logical Types that contain email ad... string {} True True None
12 Filepath filepath Represents Logical Types that specify location... string {} True True None
13 IPAddress ip_address Represents Logical Types that contain IP addre... string {} True True None
14 Integer integer Represents Logical Types that contain positive... int64 {numeric} True True IntegerNullable
15 IntegerNullable integer_nullable Represents Logical Types that contain positive... Int64 {numeric} True True None
16 LatLong lat_long Represents Logical Types that contain latitude... object {} True True None
17 NaturalLanguage natural_language Represents Logical Types that contain text or ... string {} True True None
18 Ordinal ordinal Represents Logical Types that contain ordered ... category {category} True True Categorical
19 PersonFullName person_full_name Represents Logical Types that may contain firs... string {} True True None
20 PhoneNumber phone_number Represents Logical Types that contain numeric ... string {} True True None
21 PostalCode postal_code Represents Logical Types that contain a series... category {category} True True Categorical
22 SubRegionCode sub_region_code Represents Logical Types that use the ISO-3166... category {category} True True Categorical
23 Timedelta timedelta Represents Logical Types that contain values s... timedelta64[ns] {} True True None
24 UPCCode upc_code Represents Logical Types that contain 12-digit... category {upc_code, category} False True Categorical
25 URL url Represents Logical Types that contain URLs, wh... string {} True True None
26 Unknown unknown Represents Logical Types that cannot be inferr... string {} True True None

Logical Type Relationships

When adding a new type to the type system, you can specify an optional parent LogicalType as done above. When performing type inference a given set of data might match multiple different LogicalTypes. Woodwork uses the parent-child relationship defined when registering a type to determine which type to infer in this case.

When multiple matches are found, Woodwork will return the most specific type match found. By setting the parent type to Categorical when registering the UPCCode LogicalType, you are telling Woodwork that if a data column matches both Categorical and UPCCode during inference, the column should be considered as UPCCode as this is more specific than Categorical. Woodwork always assumes that a child type is a more specific version of the parent type.

Working with Custom LogicalTypes

Next, you will create a small sample DataFrame to demonstrate use of the new custom type. This sample DataFrame includes an id column, a column with valid UPC Codes, and a column that should not be considered UPC Codes because it contains non-numeric values.

[5]:
import pandas as pd

df = pd.DataFrame(
    {
        "id": [0, 1, 2, 3],
        "code": ["012345412359", "122345712358", "012345412359", "012345412359"],
        "not_upc": ["abcdefghijkl", "122345712358", "012345412359", "022323413459"],
    }
)

Use a with block setting override to update Woodwork’s default threshold for differentiating between a Unknown and Categorical column so that Woodwork will correctly recognize the code column as a Categorical column. After setting the threshold, initialize Woodwork and verify that Woodwork has identified our column as Categorical.

[6]:
with ww.config.with_options(categorical_threshold=0.5):
    df.ww.init()
df.ww
[6]:
Physical Type Logical Type Semantic Tag(s)
Column
id int64 Integer ['numeric']
code category Categorical ['category']
not_upc string Unknown []

The reason Woodwork did not identify the code column to have a UPCCode LogicalType, is that you have not yet defined an inference function to use with this type. The inference function is what tells Woodwork how to match columns to specific LogicalTypes.

Even without the inference function, you can manually tell Woodwork that the code column should be of type UPCCode. This will set the physical type properly and apply the standard semantic tags you have defined

[7]:
df.ww.init(logical_types={"code": "UPCCode"})
df.ww
[7]:
Physical Type Logical Type Semantic Tag(s)
Column
id int64 Integer ['numeric']
code category UPCCode ['upc_code', 'category']
not_upc string Unknown []

Next, add a new inference function and allow Woodwork to automatically set the correct type for the code column.

Defining Custom Inference Functions

The first step in adding an inference function for the UPCCode LogicalType is to define an appropriate function. Inference functions always accept a single parameter, a pandas.Series. The function should return True if the series is a match for the LogicalType for which the function is associated, or False if the series is not a match.

For the UPCCode LogicalType, define a function to check that all of the values in a column are 12 character strings that contain only numbers. Note, this function is for demonstration purposes only and may not catch all cases that need to be considered for properly identifying a UPC Code.

[8]:
def infer_upc_code(series):
    # Make sure series contains only strings:
    if not series.apply(type).eq(str).all():
        return False
    # Check that all items are 12 characters long
    if all(series.str.len() == 12):
        # Try to convert to a number
        try:
            series.astype("int")
            return True
        except:
            return False
    return False

After defining the new UPC Code inference function, add it to the Woodwork type system so it can be used when inferring column types.

[9]:
ww.type_system.update_inference_function("UPCCode", inference_function=infer_upc_code)

After updating the inference function, you can reinitialize Woodwork on the DataFarme. Notice that Woodwork has correctly identified the code column to have a LogicalType of UPCCode and has correctly set the physical type and added the standard tags to the semantic tags for that column.

Also note that the not_upc column was identified as Categorical. Even though this column contains 12-digit strings, some of the values contain letters, and our inference function correctly told Woodwork this was not valid for the UPCCode LogicalType.

[10]:
df.ww.init()
df.ww
[10]:
Physical Type Logical Type Semantic Tag(s)
Column
id int64 Integer ['numeric']
code category UPCCode ['upc_code', 'category']
not_upc string Unknown []

Overriding Default Inference Functions

Overriding the default inference functions is done with the update_inference_function TypeSystem method. Simply pass in the LogicalType for which you want to override the function, along with the new function to use.

For example you can tell Woodwork to use the new infer_upc_code function for the built in Categorical LogicalType.

[11]:
ww.type_system.update_inference_function(
    "Categorical", inference_function=infer_upc_code
)

If you initialize Woodwork on a DataFrame after updating the Categorical function, you can see that the not_upc column is no longer identified as a Categorical column, but is rather set to the default Unknown LogicalType. This is because the letters in the first row of the not_upc column cause our inference function to return False for this column, while the default Categorical function will allow non-numeric values to be present. After updating the inference function, this column is no longer considered a match for the Categorical type, nor does the column match any other logical types. As a result, the LogicalType is set to Unknown, the default type used when no type matches are found.

[12]:
df.ww.init()
df.ww
[12]:
Physical Type Logical Type Semantic Tag(s)
Column
id int64 Integer ['numeric']
code category UPCCode ['upc_code', 'category']
not_upc string Unknown []

Updating LogicalType Relationships

If you need to change the parent for a registered LogicalType, you can do this with the update_relationship method. Update the new UPCCode LogicalType to be a child of NaturalLanguage instead.

[13]:
ww.type_system.update_relationship("UPCCode", parent="NaturalLanguage")

The parent for a logical type can also be set to None to indicate this is a root-level LogicalType that is not a child of any other existing LogicalType.

[14]:
ww.type_system.update_relationship("UPCCode", parent=None)

Setting the proper parent-child relationships between logical types is important. Because Woodwork will return the most specific LogicalType match found during inference, improper inference can occur if the relationships are not set correctly.

As an example, if you initialize Woodwork after setting the UPCCode LogicalType to have a parent of None, you will now see that the UPC Code column is inferred as Categorical instead of UPCCode. After setting the parent to None, UPCCode and Categorical are now siblings in the relationship graph instead of having a parent-child relationship as they did previously. When Woodwork finds multiple matches on the same level in the relationship graph, the first match is returned, which in this case is Categorical. Without proper parent-child relationships set, Woodwork is unable to determine which LogicalType is most specific.

[15]:
df.ww.init()
df.ww
[15]:
Physical Type Logical Type Semantic Tag(s)
Column
id int64 Integer ['numeric']
code category Categorical ['category']
not_upc string Unknown []

Removing a LogicalType

If a LogicalType is no longer needed, or is unwanted, it can be removed from the type system with the remove_type method. When a LogicalType has been removed, a value of False will be present in the is_registered column for the type. If a LogicalType that has children is removed, all of the children types will have their parent set to the parent of the LogicalType that is being removed, assuming a parent was defined.

Remove the custom UPCCode type and confirm it has been removed from the type system by listing the available LogicalTypes. You can confirm that the UPCCode type will no longer be used because it will have a value of False listed in the is_registered column.

[16]:
ww.type_system.remove_type("UPCCode")
ww.list_logical_types()
[16]:
name type_string description physical_type standard_tags is_default_type is_registered parent_type
0 Address address Represents Logical Types that contain address ... string {} True True None
1 Age age Represents Logical Types that contain whole nu... int64 {numeric} True True Integer
2 AgeFractional age_fractional Represents Logical Types that contain non-nega... float64 {numeric} True True Double
3 AgeNullable age_nullable Represents Logical Types that contain whole nu... Int64 {numeric} True True IntegerNullable
4 Boolean boolean Represents Logical Types that contain binary v... bool {} True True BooleanNullable
5 BooleanNullable boolean_nullable Represents Logical Types that contain binary v... boolean {} True True None
6 Categorical categorical Represents Logical Types that contain unordere... category {category} True True None
7 CountryCode country_code Represents Logical Types that use the ISO-3166... category {category} True True Categorical
8 CurrencyCode currency_code Represents Logical Types that use the ISO-4217... category {category} True True Categorical
9 Datetime datetime Represents Logical Types that contain date and... datetime64[ns] {} True True None
10 Double double Represents Logical Types that contain positive... float64 {numeric} True True None
11 EmailAddress email_address Represents Logical Types that contain email ad... string {} True True None
12 Filepath filepath Represents Logical Types that specify location... string {} True True None
13 IPAddress ip_address Represents Logical Types that contain IP addre... string {} True True None
14 Integer integer Represents Logical Types that contain positive... int64 {numeric} True True IntegerNullable
15 IntegerNullable integer_nullable Represents Logical Types that contain positive... Int64 {numeric} True True None
16 LatLong lat_long Represents Logical Types that contain latitude... object {} True True None
17 NaturalLanguage natural_language Represents Logical Types that contain text or ... string {} True True None
18 Ordinal ordinal Represents Logical Types that contain ordered ... category {category} True True Categorical
19 PersonFullName person_full_name Represents Logical Types that may contain firs... string {} True True None
20 PhoneNumber phone_number Represents Logical Types that contain numeric ... string {} True True None
21 PostalCode postal_code Represents Logical Types that contain a series... category {category} True True Categorical
22 SubRegionCode sub_region_code Represents Logical Types that use the ISO-3166... category {category} True True Categorical
23 Timedelta timedelta Represents Logical Types that contain values s... timedelta64[ns] {} True True None
24 UPCCode upc_code Represents Logical Types that contain 12-digit... category {upc_code, category} False False None
25 URL url Represents Logical Types that contain URLs, wh... string {} True True None
26 Unknown unknown Represents Logical Types that cannot be inferr... string {} True True None

Resetting Type System to Defaults

Finally, if you made multiple changes to the default Woodwork type system and would like to reset everything back to the default state, you can use the reset_defaults method as shown below. This unregisters any new types you have registered, resets all relationships to their default values and sets all inference functions back to their default functions.

[17]:
ww.type_system.reset_defaults()

Overriding Default LogicalTypes

There may be times when you would like to override Woodwork’s default LogicalTypes. An example might be if you wanted to use the nullable Int64 dtype for the Integer LogicalType instead of the default dtype of int64. In this case, you want to stop Woodwork from inferring the default Integer LogicalType and have a compatible Logical Type inferred instead. You may solve this issue in one of two ways.

First, you can create an entirely new LogicalType with its own name, MyInteger, and register it in the TypeSystem. If you want to infer it in place of the normal Integer LogicalType, you would remove Integer from the type system, and use Integer’s default inference function for MyInteger. Doing this will make it such that MyInteger will get inferred any place that Integer would have previously. Note, that because Integer has a parent LogicalType of IntegerNullable, you also need to set the parent of MyInteger to be IntegerNullable when registering with the type system.

[18]:
from woodwork.logical_types import LogicalType


class MyInteger(LogicalType):
    primary_dtype = "Int64"
    standard_tags = {"numeric"}


int_inference_fn = ww.type_system.inference_functions[ww.logical_types.Integer]

ww.type_system.remove_type(ww.logical_types.Integer)
ww.type_system.add_type(MyInteger, int_inference_fn, parent="IntegerNullable")

df.ww.init()
df.ww
[18]:
Physical Type Logical Type Semantic Tag(s)
Column
id Int64 MyInteger ['numeric']
code category Categorical ['category']
not_upc string Unknown []

Above, you can see that the id column, which was previously inferred as Integer is now inferred as MyInteger with the Int64 physical type. In the full list of Logical Types at ww.list_logical_types(), Integer and MyInteger will now both be present, but Integer’s is_registered will be False while the value for is_registered for MyInteger will be set to True.

The second option for overriding the default Logical Types allows you to create a new LogicalType with the same name as an existing one. This might be desirable because it will allow Woodwork to interpret the string 'Integer' as your new LogicalType, allowing previous code that might have selected 'Integer' to be used without updating references to a new LogicalType like MyInteger.

Before adding a LogicalType whose name already exists into the TypeSystem, you must first unregister the default LogicalType.

In order to avoid using the same name space locally between Integer LogicalTypes, it is recommended to reference Woodwork’s default LogicalType as ww.logical_types.Integer.

[19]:
ww.type_system.reset_defaults()


class Integer(LogicalType):
    primary_dtype = "Int64"
    standard_tags = {"numeric"}


int_inference_fn = ww.type_system.inference_functions[ww.logical_types.Integer]

ww.type_system.remove_type(ww.logical_types.Integer)
ww.type_system.add_type(Integer, int_inference_fn, parent="IntegerNullable")

df.ww.init()
display(df.ww)
ww.type_system.reset_defaults()
Physical Type Logical Type Semantic Tag(s)
Column
id Int64 Integer ['numeric']
code category Categorical ['category']
not_upc string Unknown []

Notice how id now gets inferred as an Integer Logical Type that has Int64 as its Physical Type!