# Customizing Logical Types and Type Inference¶

The default type system in Woodwork contains many built-in LogicalTypes that work for a wide variety of datasets. For situations in which the built-in LogicalTypes are not sufficient, Woodwork allows you to create custom LogicalTypes.

Woodwork also has a set of standard type inference functions that can help in automatically identifying correct LogicalTypes in the data. Woodwork also allows you to override these existing functions, or add new functions for inferring any custom LogicalTypes that are added.

This guide provides an overview of how to create custom LogicalTypes as well as how to override and add new type inference functions.

## Viewing Built-In Logical Types¶

To view all of the default LogicalTypes in Woodwork, use the list_logical_types function. If the existing types are not sufficient for your needs, you can create and register new LogicalTypes for use in creating DataTables and DataColumns.

[1]:

import woodwork as ww

ww.list_logical_types()

[1]:

name type_string description physical_type standard_tags is_default_type is_registered parent_type
0 Boolean boolean Represents Logical Types that contain binary v... boolean {} True True None
1 Categorical categorical Represents Logical Types that contain unordere... category {category} True True None
2 CountryCode country_code Represents Logical Types that contain categori... category {category} True True Categorical
3 Datetime datetime Represents Logical Types that contain date and... datetime64[ns] {} True True None
4 Double double Represents Logical Types that contain positive... float64 {numeric} True True None
6 Filepath filepath Represents Logical Types that specify location... string {} True True NaturalLanguage
7 FullName full_name Represents Logical Types that may contain firs... string {} True True NaturalLanguage
9 Integer integer Represents Logical Types that contain positive... Int64 {numeric} True True None
10 LatLong lat_long Represents Logical Types that contain latitude... object {} True True None
11 NaturalLanguage natural_language Represents Logical Types that contain text or ... string {} True True None
12 Ordinal ordinal Represents Logical Types that contain ordered ... category {category} True True Categorical
13 PhoneNumber phone_number Represents Logical Types that contain numeric ... string {} True True NaturalLanguage
14 SubRegionCode sub_region_code Represents Logical Types that contain codes re... category {category} True True Categorical
15 Timedelta timedelta Represents Logical Types that contain values s... timedelta64[ns] {} True True None
16 URL url Represents Logical Types that contain URLs, wh... string {} True True NaturalLanguage
17 ZIPCode zip_code Represents Logical Types that contain a series... category {category} True True Categorical

## Registering a New LogicalType¶

The first step in registering a new LogicalType is to define the class for the new type. This is done by sub-classing the built-in LogicalType class. There are a few class attributes that should be set when defining this new class. Each is reviewed in more detail below.

For this example, you will work through an example for a dataset that contains UPC Codes. First create a new UPCCode LogicalType. For this example, consider the UPC Code to be a type of categorical variable.

[2]:

from woodwork.logical_types import LogicalType

class UPCCode(LogicalType):
"""Represents Logical Types that contain 12-digit UPC Codes."""
pandas_dtype = 'category'
backup_dtype = 'str'
standard_tags = {'category', 'upc_code'}


When defining the UPCCode LogicalType class, three class attributes were set. All three of these attributes are optional, and will default to the values defined on the LogicalType class if they are not set when defining the new type.

• pandas_dtype: This value specifies how the data will be stored. If the column in the underlying dataframe is not already of this type, Woodwork will convert the data to this dtype. This should be specified as a string that represents a valid pandas dtype. If not specified, this will default to 'string'.

• backup_dtype: This is primarily useful when working with Koalas dataframes. backup_dtype specifies the dtype to use if Woodwork is unable to convert to the primary dtype specified by pandas_dtype. In our example, we set this to 'str' since Koalas does not currently support the 'category' dtype.

• standard_tags: This is a set of semantic tags to apply to any column that is set with the specified LogicalType. If not specified, standard_tags will default to an empty set.

• docstring: Adding a docstring for the class is optional, but if specified, this text will be used for adding a description of the type in the list of available types returned by ww.list_logical_types().

Note

Behind the scenes, Woodwork uses the category and numeric semantic tags to determine whether a column is categorical or numeric column, respectively. If the new LogicalType you define represents a categorical or numeric type, you should include the appropriate tag in the set of tags specified for standard_tags.

Now that you have created the new LogicalType, you can register it with the Woodwork type system so you can use it. All modifications to the type system are performed by calling the appropriate method on the ww.type_system object.

[3]:

ww.type_system.add_type(UPCCode, parent='Categorical')


If you once again list the available LogicalTypes, you will see the new type you created was added to the list, including the values for description, physical_type and standard_tags specified when defining the UPCCode LogicalType.

[4]:

ww.list_logical_types()

[4]:

name type_string description physical_type standard_tags is_default_type is_registered parent_type
0 Boolean boolean Represents Logical Types that contain binary v... boolean {} True True None
1 Categorical categorical Represents Logical Types that contain unordere... category {category} True True None
2 CountryCode country_code Represents Logical Types that contain categori... category {category} True True Categorical
3 Datetime datetime Represents Logical Types that contain date and... datetime64[ns] {} True True None
4 Double double Represents Logical Types that contain positive... float64 {numeric} True True None
6 Filepath filepath Represents Logical Types that specify location... string {} True True NaturalLanguage
7 FullName full_name Represents Logical Types that may contain firs... string {} True True NaturalLanguage
9 Integer integer Represents Logical Types that contain positive... Int64 {numeric} True True None
10 LatLong lat_long Represents Logical Types that contain latitude... object {} True True None
11 NaturalLanguage natural_language Represents Logical Types that contain text or ... string {} True True None
12 Ordinal ordinal Represents Logical Types that contain ordered ... category {category} True True Categorical
13 PhoneNumber phone_number Represents Logical Types that contain numeric ... string {} True True NaturalLanguage
14 SubRegionCode sub_region_code Represents Logical Types that contain codes re... category {category} True True Categorical
15 Timedelta timedelta Represents Logical Types that contain values s... timedelta64[ns] {} True True None
16 UPCCode upc_code Represents Logical Types that contain 12-digit... category {upc_code, category} False True Categorical
17 URL url Represents Logical Types that contain URLs, wh... string {} True True NaturalLanguage
18 ZIPCode zip_code Represents Logical Types that contain a series... category {category} True True Categorical

### Logical Type Relationships¶

When adding a new type to the type system, you can specify an optional parent LogicalType as done above. When performing type inference a given set of data might match multiple different LogicalTypes. Woodwork uses the parent-child relationship defined when registering a type to determine which type to infer in this case.

When multiple matches are found, Woodwork will return the most specific type match found. By setting the parent type to Categorical when registering the UPCCode LogicalType, you are telling Woodwork that if a data column matches both Categorical and UPCCode during inference, the column should be considered as UPCCode as this is more specific than Categorical. Woodwork always assumes that a child type is a more specific version of the parent type.

## Working with Custom LogicalTypes¶

Next, you will create a small sample DataFrame to demonstrate use of the new custom type. This sample DataFrame includes an id column, a column with valid UPC Codes, and a column that should not be considered UPC Codes because it contains non-numeric values.

[5]:

import pandas as pd
dataframe = pd.DataFrame({
'id': [0, 1, 2, 3],
'code': ['012345412359', '122345712358', '012345412359', '022323413459'],
'not_upc': ['abcdefghijkl', '122345712358', '012345412359', '022323413459']
})


Before using this dataframe, update Woodwork’s default threshold for differentiating between a NaturalLanguage and Categorical column so that Woodwork will correctly recognize the code column as a Categorical column. After setting the threshold, create a new DataTable and verify that Woodwork has identified our column as Categorical.

[6]:

ww.config.set_option('natural_language_threshold', 12)
dt = ww.DataTable(dataframe)
dt

[6]:

Physical Type Logical Type Semantic Tag(s)
Data Column
id Int64 Integer ['numeric']
code category Categorical ['category']
not_upc category Categorical ['category']

The reason Woodwork did not identify the code column to have a UPCCode LogicalType, is that you have not yet defined an inference function to use with this type. The inference function is what tells Woodwork how to match columns to specific LogicalTypes.

Even without the inference function, you can manually tell Woodwork that the code column should be of type UPCCode. This will set the physical type properly and apply the standard semantic tags you have defined

[7]:

dt = ww.DataTable(dataframe, logical_types = {'code': 'UPCCode'})
dt

[7]:

Physical Type Logical Type Semantic Tag(s)
Data Column
id Int64 Integer ['numeric']
code category UPCCode ['upc_code', 'category']
not_upc category Categorical ['category']

Next, add a new inference function and allow Woodwork to automatically set the correct type for the code column.

## Defining Custom Inference Functions¶

The first step in adding an inference function for the UPCCode LogicalType is to define an appropriate function. Inference functions always accept a single parameter, a pandas.Series. The function should return True if the series is a match for the LogicalType for which the function is associated, or False if the series is not a match.

For the UPCCode LogicalType, define a function to check that all of the values in a column are 12 character strings that contain only numbers. Note, this function is for demonstration purposes only and may not catch all cases that need to be considered for properly identifying a UPC Code.

[8]:

def infer_upc_code(series):
# Make sure series contains only strings:
if not series.apply(type).eq(str).all():
return False
# Check that all items are 12 characters long
if all(series.str.len() == 12):
# Try to convert to a number
try:
series.astype('int')
return True
except:
return False
return False


After defining the new UPC Code inference function, add it to the Woodwork type system so it can be used when inferring column types.

[9]:

ww.type_system.update_inference_function('UPCCode', inference_function=infer_upc_code)


After updating the inference function, you can create a new DataTable from the same DataFrame. Notice that Woodwork has correctly identified the code column to have a LogicalType of UPCCode and has correctly set the physical type and added the standard tags to the semantic tags for that column.

Also note that the not_upc column was identified as Categorical. Even though this column contains 12-digit strings, some of the values contain letters, and our inference function correctly told Woodwork this was not valid for the UPCCode LogicalType.

[10]:

dt = ww.DataTable(dataframe)
dt

[10]:

Physical Type Logical Type Semantic Tag(s)
Data Column
id Int64 Integer ['numeric']
code category UPCCode ['upc_code', 'category']
not_upc category Categorical ['category']

## Overriding Default Inference Functions¶

Overriding the default inference functions is done with the update_inference_function TypeSystem method. Simply pass in the LogicalType for which you want to override the function, along with the new function to use.

For example you can tell Woodwork to use the new infer_upc_code function for the built in Categorical LogicalType.

[11]:

ww.type_system.update_inference_function('Categorical', inference_function=infer_upc_code)


If you create a new DataTable after updating the Categorical function, you can see that the not_upc column is no longer identified as a Categorical column, but is rather set to the default NaturalLanguage LogicalType. This is because the letters in the first row of the not_upc column cause our inference function to return False for this column, while the default Categorical function will allow non-numeric values to be present.

[12]:

dt = ww.DataTable(dataframe)
dt

[12]:

Physical Type Logical Type Semantic Tag(s)
Data Column
id Int64 Integer ['numeric']
code category UPCCode ['upc_code', 'category']
not_upc string NaturalLanguage []

## Updating LogicalType Relationships¶

If you need to change the parent for a registered LogicalType, you can do this with the update_relationship method. Update the new UPCCode LogicalType to be a child of NaturalLanguage instead.

[13]:

ww.type_system.update_relationship('UPCCode', parent='NaturalLanguage')


The parent for a logical type can also be set to None to indicate this is a root-level LogicalType that is not related to any other existing LogicalType.

[14]:

ww.type_system.update_relationship('UPCCode', parent=None)


Setting the proper parent-child relationships between logical types is important. Because Woodwork will return the most specific LogicalType match found during inference, improper inference can occur if the relationships are not set correctly.

As an example, if you create a new DataTable after setting the UPCCode LogicalType to have a parent of None, you will now see that the UPC Code column is inferred as Categorical instead of UPCCode. After setting the parent to None, UPCCode and Categorical are now siblings in the relationship graph instead of having a parent-child relationship as they did previously. When Woodwork finds multiple matches on the same level in the relationship graph, the first match is returned, which in this case is Categorical. Without proper parent-child relationships set, Woodwork is unable to determine which LogicalType is most specific.

[15]:

dt = ww.DataTable(dataframe)
dt

[15]:

Physical Type Logical Type Semantic Tag(s)
Data Column
id Int64 Integer ['numeric']
code category Categorical ['category']
not_upc string NaturalLanguage []

## Removing a LogicalType¶

If a LogicalType is no longer needed, or is unwanted, it can be removed from the type system with the remove_type method. If a LogicalType that has children is removed, all of the children types will have their parent set to the parent of the LogicalType that is being removed, assuming a parent was defined.

Remove the custom UPCCode type and confirm it has been removed by listing the available LogicalTypes.

[16]:

ww.type_system.remove_type('UPCCode')
ww.list_logical_types()

[16]:

name type_string description physical_type standard_tags is_default_type is_registered parent_type
0 Boolean boolean Represents Logical Types that contain binary v... boolean {} True True None
1 Categorical categorical Represents Logical Types that contain unordere... category {category} True True None
2 CountryCode country_code Represents Logical Types that contain categori... category {category} True True Categorical
3 Datetime datetime Represents Logical Types that contain date and... datetime64[ns] {} True True None
4 Double double Represents Logical Types that contain positive... float64 {numeric} True True None
6 Filepath filepath Represents Logical Types that specify location... string {} True True NaturalLanguage
7 FullName full_name Represents Logical Types that may contain firs... string {} True True NaturalLanguage
9 Integer integer Represents Logical Types that contain positive... Int64 {numeric} True True None
10 LatLong lat_long Represents Logical Types that contain latitude... object {} True True None
11 NaturalLanguage natural_language Represents Logical Types that contain text or ... string {} True True None
12 Ordinal ordinal Represents Logical Types that contain ordered ... category {category} True True Categorical
13 PhoneNumber phone_number Represents Logical Types that contain numeric ... string {} True True NaturalLanguage
14 SubRegionCode sub_region_code Represents Logical Types that contain codes re... category {category} True True Categorical
15 Timedelta timedelta Represents Logical Types that contain values s... timedelta64[ns] {} True True None
16 UPCCode upc_code Represents Logical Types that contain 12-digit... category {upc_code, category} False False None
17 URL url Represents Logical Types that contain URLs, wh... string {} True True NaturalLanguage
18 ZIPCode zip_code Represents Logical Types that contain a series... category {category} True True Categorical

## Resetting Type System to Defaults¶

Finally, if you made multiple changes to the default Woodwork type system and would like to reset everything back to the default state, you can use the reset_defaults method as shown below. This unregisters any new types you have registered, resets all relationships to their default values and sets all inference functions back to their default functions.

[17]:

ww.type_system.reset_defaults()


## Overriding Default LogicalTypes¶

There may be times when you would like to override Woodwork’s default LogicalTypes. An example might be if the Int64 dtype that the Integer LogicalType uses is incompatible with your data and you would like it to use int64. In this case, you want to stop Woodwork from inferring the default Integer LogicalType and have a compatible Logical Type inferred instead. You may solve this issue in one of two ways.

First, you can create an entirely new LogicalType with its own name, MyInteger, and register it in the TypeSystem. If you want to infer it in place of the normal Integer LogicalType, you would remove Integer from the type system, and use Integer’s default inference function for MyInteger. Doing this will make it such that MyInteger will get inferred any place that Integer would have previously.

[18]:

from woodwork.logical_types import LogicalType

class MyInteger(LogicalType):
pandas_dtype = 'int64'
standard_tags = {'numeric'}
int_inference_fn = ww.type_system.inference_functions[ww.logical_types.Integer]

ww.type_system.remove_type(ww.logical_types.Integer)

dt = ww.DataTable(dataframe)
dt

[18]:

Physical Type Logical Type Semantic Tag(s)
Data Column
id int64 MyInteger ['numeric']
code category Categorical ['category']
not_upc category Categorical ['category']

Above, you can see that the id column, which was previously inferred as Integer is now inferred as MyInteger with the int64 physical type. In the full list of Logical Types at ww.list_logical_types(), Integer and MyInteger will now both be present, but Integer’s is_registered will be False while the value for is_registered for MyInteger will be set to True.

The second option for overriding the default Logical Types allows you to create a new LogicalType with the same name as an existing one. This might be desirable because it will allow Woodwork to interpret the string 'Integer' as your new LogicalType, allowing previous code that might have selected 'Integer' to be used without updating references to a new LogicalType like MyInteger.

Before adding a LogicalType whose name already exists into the TypeSystem, you must first unregister the default LogicalType.

In order to avoid using the same name space locally between Integer LogicalTypes, it is recommended to reference Woodwork’s default LogicalType as ww.logical_types.Integer.

[19]:

ww.type_system.reset_defaults()
class Integer(LogicalType):
pandas_dtype = 'int64'
standard_tags = {'numeric'}
int_inference_fn = ww.type_system.inference_functions[ww.logical_types.Integer]

ww.type_system.remove_type(ww.logical_types.Integer)

Notice how id now gets inferred as an Integer Logical Type that has int64 as its Physical Type!