Customizing Logical Types and Type Inference

The default type system in Woodwork contains many built-in LogicalTypes that will work for a wide variety of datasets. For situations in which the built-in LogicalTypes are not sufficient, Woodwork allows users to create custom LogicalTypes.

Woodwork also has a set of standard type inference functions that can help in automatically identifying correct LogicalTypes in the data. Woodwork also allows users to override these existing functions, or add new functions for inferring any custom LogicalTypes that are added.

This guide will provide an overview of how to create custom LogicalTypes as well as how to override and add new type inference functions.

Viewing Built-In Logical Types

To view all of the default LogicalTypes in Woodwork, users can use the list_logical_types function. If the existing types are not sufficient for your needs, you can create and register new LogicalTypes for use in creating DataTables and DataColumns.

[1]:
import woodwork as ww

ww.list_logical_types()
[1]:
name type_string description physical_type standard_tags is_default_type is_registered parent_type
0 Boolean boolean Represents Logical Types that contain binary v... boolean {} True True None
1 Categorical categorical Represents Logical Types that contain unordere... category {category} True True None
2 CountryCode country_code Represents Logical Types that contain categori... category {category} True True Categorical
3 Datetime datetime Represents Logical Types that contain date and... datetime64[ns] {} True True None
4 Double double Represents Logical Types that contain positive... float64 {numeric} True True None
5 EmailAddress email_address Represents Logical Types that contain email ad... string {} True True NaturalLanguage
6 Filepath filepath Represents Logical Types that specify location... string {} True True NaturalLanguage
7 FullName full_name Represents Logical Types that may contain firs... string {} True True NaturalLanguage
8 IPAddress ip_address Represents Logical Types that contain IP addre... string {} True True NaturalLanguage
9 Integer integer Represents Logical Types that contain positive... Int64 {numeric} True True None
10 LatLong lat_long Represents Logical Types that contain latitude... object {} True True None
11 NaturalLanguage natural_language Represents Logical Types that contain text or ... string {} True True None
12 Ordinal ordinal Represents Logical Types that contain ordered ... category {category} True True Categorical
13 PhoneNumber phone_number Represents Logical Types that contain numeric ... string {} True True NaturalLanguage
14 SubRegionCode sub_region_code Represents Logical Types that contain codes re... category {category} True True Categorical
15 Timedelta timedelta Represents Logical Types that contain values s... timedelta64[ns] {} True True None
16 URL url Represents Logical Types that contain URLs, wh... string {} True True NaturalLanguage
17 ZIPCode zip_code Represents Logical Types that contain a series... category {category} True True Categorical

Registering a New LogicalType

The first step in registering a new LogicalType is to define the class for the new type. This can be done simply by sub-classing the built-in LogicalType class. There are a few class attributes that should be set when defining this new class. Each will be discussed in more detail below.

We will work through an example for a dataset that contains UPC Codes. First let’s create a new UPCCode LogicalType. For this example we will consider the UPC Code to be a type of categorical variable.

[2]:
from woodwork.logical_types import LogicalType

class UPCCode(LogicalType):
    """Represents Logical Types that contain 12-digit UPC Codes."""
    pandas_dtype = 'category'
    backup_dtype = 'str'
    standard_tags = {'category', 'upc_code'}

When defining the UPCCode LogicalType class, three class attributes were set. All three of these attributes are optional, and will default to the values defined on the LogicalType class if they are not set when defining the new type.

  • pandas_dtype: This value specifies how the data will be stored. If the column in the underlying dataframe is not already of this type, Woodwork will convert the data to this dtype. This should be specified as a string that represents a valid pandas dtype. If not specified, this will default to 'string'.

  • backup_dtype: This is primarily useful when working with Koalas dataframes. backup_dtype specifies the dtype to use if Woodwork is unable to convert to the primary dtype specified by pandas_dtype. In our example, we set this to 'str' since Koalas does not currently support the 'category' dtype.

  • standard_tags: This is a set of semantic tags to apply to any column that is set with the specified LogicalType. If not specified, standard_tags will default to an empty set.

  • docstring: Adding a docstring for the class is optional, but if specified, this text will be used for adding a description of the type in the list of available types returned by ww.list_logical_types().

Note

Behind the scenes, Woodwork uses the category and numeric semantic tags to determine whether a column is categorical or numeric column, respectively. If the new LogicalType you define represents a categorical or numeric type, you should include the appropriate tag in the set of tags specified for standard_tags.

Now that we have created our new LogicalType, we can register it with the Woodwork type system, so we can use it. All modifications to the type system can be performed by calling the appropriate method on the ww.type_system object.

[3]:
ww.type_system.add_type(UPCCode, parent='Categorical')

Now, if we once again list the available LogicalTypes, we will see the new type we have created has been added to the list, including the values for description, physical_type and standard_tags we specified when defining the UPCCode LogicalType.

[4]:
ww.list_logical_types()
[4]:
name type_string description physical_type standard_tags is_default_type is_registered parent_type
0 Boolean boolean Represents Logical Types that contain binary v... boolean {} True True None
1 Categorical categorical Represents Logical Types that contain unordere... category {category} True True None
2 CountryCode country_code Represents Logical Types that contain categori... category {category} True True Categorical
3 Datetime datetime Represents Logical Types that contain date and... datetime64[ns] {} True True None
4 Double double Represents Logical Types that contain positive... float64 {numeric} True True None
5 EmailAddress email_address Represents Logical Types that contain email ad... string {} True True NaturalLanguage
6 Filepath filepath Represents Logical Types that specify location... string {} True True NaturalLanguage
7 FullName full_name Represents Logical Types that may contain firs... string {} True True NaturalLanguage
8 IPAddress ip_address Represents Logical Types that contain IP addre... string {} True True NaturalLanguage
9 Integer integer Represents Logical Types that contain positive... Int64 {numeric} True True None
10 LatLong lat_long Represents Logical Types that contain latitude... object {} True True None
11 NaturalLanguage natural_language Represents Logical Types that contain text or ... string {} True True None
12 Ordinal ordinal Represents Logical Types that contain ordered ... category {category} True True Categorical
13 PhoneNumber phone_number Represents Logical Types that contain numeric ... string {} True True NaturalLanguage
14 SubRegionCode sub_region_code Represents Logical Types that contain codes re... category {category} True True Categorical
15 Timedelta timedelta Represents Logical Types that contain values s... timedelta64[ns] {} True True None
16 UPCCode upc_code Represents Logical Types that contain 12-digit... category {upc_code, category} False True Categorical
17 URL url Represents Logical Types that contain URLs, wh... string {} True True NaturalLanguage
18 ZIPCode zip_code Represents Logical Types that contain a series... category {category} True True Categorical

Logical Type Relationships

When adding a new type to the type system, users can specify an optional parent LogicalType as we have done above. When performing type inference it is possible that a given set of data will match multiple different LogicalTypes. Woodwork uses the parent-child relationship defined when registering a type to determine which type to infer in this case.

When multiple matches are found, Woodwork will return the most specific type match found. By setting the parent type to Categorical when registering our UPCCode LogicalType, we are telling Woodwork that if a data column matches both Categorical and UPCCode during inference, the column should be considered as UPCCode as this is more specific than Categorical. Woodwork always assumes that a child type is a more specific version than the parent type.

Working with Custom LogicalTypes

Next, we will create a small sample DataFrame to demonstrate how we can use our new custom type. This sample DataFrame includes an id column, a column with valid UPC Codes, and a column that should not be considered UPC Codes because it contains non-numeric values.

[5]:
import pandas as pd
dataframe = pd.DataFrame({
    'id': [0, 1, 2, 3],
    'code': ['012345412359', '122345712358', '012345412359', '022323413459'],
    'not_upc': ['abcdefghijkl', '122345712358', '012345412359', '022323413459']
})

Before we use this dataframe, let’s update Woodwork’s default threshold for differentiating between a NaturalLanguage and Categorical column so that Woodwork will correctly recognize our code column as a Categorical column. After setting the threshold, we can create a new DataTable and verify that Woodwork has identified our column as Categorical.

[6]:
ww.config.set_option('natural_language_threshold', 12)
dt = ww.DataTable(dataframe)
dt
[6]:
Physical Type Logical Type Semantic Tag(s)
Data Column
id Int64 Integer ['numeric']
code category Categorical ['category']
not_upc category Categorical ['category']

The reason Woodwork did not identify the code column to have a UPCCode LogicalType, is that we have not yet defined an inference function to use with this type. The inference function is what tells Woodwork how to match columns to specific LogicalTypes.

Even without the inference function, we can manually tell Woodwork that the code column should be of type UPCCode. This will set the physical type properly and apply the standard semantic tags we have defined

[7]:
dt = ww.DataTable(dataframe, logical_types = {'code': 'UPCCode'})
dt
[7]:
Physical Type Logical Type Semantic Tag(s)
Data Column
id Int64 Integer ['numeric']
code category UPCCode ['upc_code', 'category']
not_upc category Categorical ['category']

Now, let’s add a new inference function and allow Woodwork to automatically set the correct type for the code column.

Defining Custom Inference Functions

The first step in adding an inference function for the UPCCode LogicalType is to define an appropriate function. Inference functions should always accept a single parameter, a pandas.Series. The function should return True if the series is a match for the LogicalType for which the function is associated, or False if the series is not a match.

For the UPCCode LogicalType, let’s define a function to check that all of the values in a column are 12 character strings that contain only numbers. Note, this function is for demonstration purposes only and may not catch all cases that need to be considered for properly identifying a UPC Code.

[8]:
def infer_upc_code(series):
    # Make sure series contains only strings:
    if not series.apply(type).eq(str).all():
        return False
    # Check that all items are 12 characters long
    if all(series.str.len() == 12):
        # Try to convert to a number
        try:
            series.astype('int')
            return True
        except:
            return False
    return False

After defining our new UPC Code inference function, we can add it to the Woodwork type system, so it can be used when inferring column types.

[9]:
ww.type_system.update_inference_function('UPCCode', inference_function=infer_upc_code)

After updating the inference function, we can create a new datatable from the same DataFrame. In doing so, we see that Woodwork has correctly identified the code column to have a LogicalType of UPCCode and has correctly set the physical type and added the standard tags to the semantic tags for that column.

Also note, that the not_upc column was identified as Categorical. Even though this column contain 12-digit strings, some of the values contained letters, and our inference function correctly told Woodwork this was not valid for the UPCCode LogicalType.

[10]:
dt = ww.DataTable(dataframe)
dt
[10]:
Physical Type Logical Type Semantic Tag(s)
Data Column
id Int64 Integer ['numeric']
code category UPCCode ['upc_code', 'category']
not_upc category Categorical ['category']

Overriding Default Inference Functions

Overriding the default inference functions can also be done using the update_inference_function type system method. Simply pass in the LogicalType for which you want to override the function, along with the new function to use.

For example we can also tell Woodwork to use our new infer_upc_code function for the built in Categorical LogicalType as well.

[11]:
ww.type_system.update_inference_function('Categorical', inference_function=infer_upc_code)

If we create a new DataTable, after updating the Categorical function, we will now see that the not_upc column is no longer identified as a Categorical column, but is rather set to the default NaturalLanguage LogicalType. This is because the letters in the first row of the not_upc column cause our inference function to return False for this column, while the default Categorical function will allow non-numeric values to be present.

[12]:
dt = ww.DataTable(dataframe)
dt
[12]:
Physical Type Logical Type Semantic Tag(s)
Data Column
id Int64 Integer ['numeric']
code category UPCCode ['upc_code', 'category']
not_upc string NaturalLanguage []

Updating LogicalType Relationships

If you need to change the parent for a registered LogicalType, you can do this using the update_relationship method. Let’s update our new UPCCode LogicalType to be a child of NaturalLanguage instead.

[13]:
ww.type_system.update_relationship('UPCCode', parent='NaturalLanguage')

The parent for a logical type can also be set to None to indicate this is a root-level LogicalType that is not related to any other existing LogicalType.

[14]:
ww.type_system.update_relationship('UPCCode', parent=None)

Setting the proper parent-child relationships between logical types is important. Because Woodwork will return the most specific LogicalType match found during inference, improper inference can occur if the relationships are not set correctly.

As an example, if we create a new DataTable after setting our new UPCCode LogicalType to have a parent of None, we will now see that the UPC Code column is inferred as Categorical instead of UPCCode. After setting the parent to None, UPCCode and Categorical are now siblings in the relationship graph, instead of having a parent-child relationship as they did previously. When Woodwork finds multiple matches on the same level in the relationship graph, the first match is returned, which in this case is Categorical. Without proper parent-child relationships set, Woodwork is unable to determine which LogicalType is most specific.

[15]:
dt = ww.DataTable(dataframe)
dt
[15]:
Physical Type Logical Type Semantic Tag(s)
Data Column
id Int64 Integer ['numeric']
code category Categorical ['category']
not_upc string NaturalLanguage []

Removing a LogicalType

If a LogicalType is no longer needed, or is unwanted, it can be removed from the type system with the remove_type method. If a LogicalType that has children is removed, all of the children types will have their parent set to the parent of the LogicalType that is being removed, assuming a parent was defined.

Now that we are done with our example, let’s remove our custom UPCCode type and confirm it has been removed by listing the available LogicalTypes.

[16]:
ww.type_system.remove_type('UPCCode')
ww.list_logical_types()
[16]:
name type_string description physical_type standard_tags is_default_type is_registered parent_type
0 Boolean boolean Represents Logical Types that contain binary v... boolean {} True True None
1 Categorical categorical Represents Logical Types that contain unordere... category {category} True True None
2 CountryCode country_code Represents Logical Types that contain categori... category {category} True True Categorical
3 Datetime datetime Represents Logical Types that contain date and... datetime64[ns] {} True True None
4 Double double Represents Logical Types that contain positive... float64 {numeric} True True None
5 EmailAddress email_address Represents Logical Types that contain email ad... string {} True True NaturalLanguage
6 Filepath filepath Represents Logical Types that specify location... string {} True True NaturalLanguage
7 FullName full_name Represents Logical Types that may contain firs... string {} True True NaturalLanguage
8 IPAddress ip_address Represents Logical Types that contain IP addre... string {} True True NaturalLanguage
9 Integer integer Represents Logical Types that contain positive... Int64 {numeric} True True None
10 LatLong lat_long Represents Logical Types that contain latitude... object {} True True None
11 NaturalLanguage natural_language Represents Logical Types that contain text or ... string {} True True None
12 Ordinal ordinal Represents Logical Types that contain ordered ... category {category} True True Categorical
13 PhoneNumber phone_number Represents Logical Types that contain numeric ... string {} True True NaturalLanguage
14 SubRegionCode sub_region_code Represents Logical Types that contain codes re... category {category} True True Categorical
15 Timedelta timedelta Represents Logical Types that contain values s... timedelta64[ns] {} True True None
16 UPCCode upc_code Represents Logical Types that contain 12-digit... category {upc_code, category} False False None
17 URL url Represents Logical Types that contain URLs, wh... string {} True True NaturalLanguage
18 ZIPCode zip_code Represents Logical Types that contain a series... category {category} True True Categorical

Resetting Type System to Defaults

Finally, if you have multiple changes to the default Woodwork type system and would like to reset everything back to the default state, you can use the reset_defaults method as shown below. This will unregister any new types you have registered, reset all relationships to their default values and set all inference functions back to their default functions.

[17]:
ww.type_system.reset_defaults()