Understanding Types and Tags

Using Woodwork effectively requires a good understanding of physical types, logical types and semantic tags, concepts which are core to Woodwork. This guide provides a detailed overview of types and tags as well as how to work with them when using Woodwork.

Definitions of Types and Tags

Woodwork has been designed to allow users to easily specify additional information about data contained in a DataTable, while providing interfaces to access the underlying data based on this additional information. Because a single DataTable may store various types of data such as numbers, text or dates in different columns, the additional information is defined on a per-column basis.

There are three main ways that Woodwork stores additional information about user data:

  • Physical Type: defines how the data is stored on disk or in memory

  • Logical Type: defines how the data should be parsed or interpreted

  • Semantic Tag(s): provides additional data about the meaning of the data or how it should be used

Each of these are discussed in more detail throughout this guide.

Physical Types

Physical types define how the data is stored on disk or in memory. You may also see the physical type for a column referred to as the column’s dtype.

For example, typical Pandas dtypes often used include object, int64, float64 and datetime64[ns], although there are many more. Within Woodwork, there are seven different physical types that are used, each corresponding to a Pandas dtype. When a DataTable is created, the dtype of the underlying data will be converted to one of these values, if it is not already one of these types:

  • boolean

  • category

  • datetime64[ns]

  • float64

  • Int64

  • string

  • timedelta64[ns]

The physical type conversion is done based on the LogicalType that has been specified or inferred for a given column.

Logical Types

Logical types define how data should be interpreted or parsed. Logical types provide an additional level of detail beyond the physical type. Some columns might share the same physical type, but might have different parsing requirements depending on the information that is stored in the column.

For example, email addresses and phone numbers would typically both be stored in a data column with a physical type of string. However, when reading and validating these two types of information, very different rules apply. For email addresses, the presence of the @ symbol is important. For phone numbers, you may want to confirm that only a certain number of digits are present and special characters might be restricted to +, -, ( or ). In this particular example Woodwork defines two different logical types to separate these parsing needs: EmailAddress and PhoneNumber.

There are many different logical types defined within Woodwork. To get a complete list of all the available logical types, you can use the list_logical_types function as shown below.

[1]:
from woodwork import list_logical_types
list_logical_types()
[1]:
name type_string description physical_type standard_tags
0 Boolean boolean Represents Logical Types that contain binary v... boolean {}
1 Categorical categorical Represents Logical Types that contain unordere... category {category}
2 CountryCode country_code Represents Logical Types that contain categori... category {category}
3 Datetime datetime Represents Logical Types that contain date and... datetime64[ns] {}
4 Double double Represents Logical Types that contain positive... float64 {numeric}
5 Integer integer Represents Logical Types that contain positive... Int64 {numeric}
6 EmailAddress email_address Represents Logical Types that contain email ad... string {}
7 Filepath filepath Represents Logical Types that specify location... string {}
8 FullName full_name Represents Logical Types that may contain firs... string {}
9 IPAddress ip_address Represents Logical Types that contain IP addre... string {}
10 LatLong lat_long Represents Logical Types that contain latitude... string {}
11 NaturalLanguage natural_language Represents Logical Types that contain text or ... string {}
12 Ordinal ordinal Represents Logical Types that contain ordered ... category {category}
13 PhoneNumber phone_number Represents Logical Types that contain numeric ... string {}
14 SubRegionCode sub_region_code Represents Logical Types that contain codes re... category {category}
15 Timedelta timedelta Represents Logical Types that contain values s... timedelta64[ns] {}
16 URL url Represents Logical Types that contain URLs, wh... string {}
17 ZIPCode zip_code Represents Logical Types that contain a series... category {category}

In the table of logical types shown above, you will notice that each logical type has a specific pandas_dtype value associated with it. Within Woodwork, any time a logical type is set for a column, the physical type of the underlying data will be converted to the type shown in the pandas_dtype column. Within Woodwork, there is only one physical type associated with each logical type.

Semantic Tags

Semantic tags define additional context about the meaning of a data column. This could directly impact how the information contained in the column is interpreted. Unlike physical types and logical types, semantic tags are much less restrictive. A column may contain many semantic tags, or none at all. However, when assigning semantic tags, users should take care to not assign tags that have conflicting meanings.

As an example of how semantic tags can be useful, let’s consider a data set with two date columns: a signup date and a user birth date. Both of these columns will have the same physical type (datetime64[ns]) and both will have the same logical type (Datetime). However, semantic tags could be used to differentiate these columns. For example you might want to add the date_of_birth semantic tag to the user birth date column to indicate this column has special meaning and could be used to compute a user’s age. Computing an age from the signup date column would not make sense, so the semantic tag can be used to differentiate between what the dates in these columns really mean.

Standard Semantic Tags

As you can see from the table that was generated with the list_logical_types function above, Woodwork has some standard tags that are be applied to certain columns by default. Woodwork will add a standard set of semantic tags to columns with LogicalTypes that fall under certain, predefined categories.

The standard tags are as follows:

  • 'numeric' - The tag applied to numeric Logical Types

    • Integer

    • Double

  • 'category' - The tag applied to Logical Types that represent categorical variables

    • Categorical

    • CountryCode

    • Ordinal

    • SubRegionCode

    • ZIPCode

There are also two tags that get added to index columns. If no index columns have been specified, these tags are not present:

  • 'index' - on the index column, when specified

  • 'time_index' on the time index column, when specified

The application of standard tags, excluding the index and time_index tags, which have special meaning, can be controlled by the user. This will be discussed in more detail in the Working with Semantic Tags section below. There are a few different semantic tags defined within Woodwork. To get a list of the standard, index, and time index tags, you can use the list_semantic_tags function as shown below.

[2]:
from woodwork import list_semantic_tags
list_semantic_tags()
[2]:
name is_standard_tag valid_logical_types
0 category True [Categorical, CountryCode, Ordinal, SubRegionC...
1 numeric True [Double, Integer]
2 index False [Integer, Double, Categorical, Datetime]
3 time_index False [Datetime]
4 date_of_birth False [Datetime]

Working with Logical Types

When creating a DataTable, users have the option to specify the logical types for all, some, or none of the columns in the underlying dataframe. If logical types are defined for all of the columns, these logical types will be used directly, provided the data is compatible with the specified logical type. You cannot, for example, use a logical type of Integer on a column that contains text values that cannot be converted to integers.

If users do not supply any logical type information during creation of the DataTable, Woodwork will infer the logical types based on the physical type of the column and the information contained in the columns. If the user passes information for some of the columns, the logical types will be inferred for any columns not specified.

These scenarios are illustrated below. First, we will create a simple dataframe to use for this example.

[3]:
import pandas as pd
from woodwork import DataTable

data = pd.DataFrame({
    'integers': [-2, 30, 20],
    'bools': [True, False, True],
    'names': ["Jane Doe", "Bill Smith", "John Hancock"]
})
data
[3]:
integers bools names
0 -2 True Jane Doe
1 30 False Bill Smith
2 20 True John Hancock

Now that we have created the data to use for the example, we will create a DataTable object, assigning logical type values for each of the columns. We can then view the types stored for each column by using the DataTable.types property.

[4]:
logical_types = {
    'integers': 'Integer',
    'bools': 'Boolean',
    'names': 'FullName'
}

dt = DataTable(data, logical_types=logical_types)
dt.types
[4]:
Physical Type Logical Type Semantic Tag(s)
Data Column
integers Int64 Integer {numeric}
bools boolean Boolean {}
names string FullName {}

As you can see, the logical types that we specified have been assigned to each of the columns. Now let’s try this by assigning only one logical type value, and letting Woodwork infer the types for the other columns:

[5]:
logical_types = {
    'names': 'FullName',
}
dt = DataTable(data, logical_types=logical_types)
dt.types
[5]:
Physical Type Logical Type Semantic Tag(s)
Data Column
integers Int64 Integer {numeric}
bools boolean Boolean {}
names string FullName {}

With this input, we get the same results. Woodwork used the FullName logical type we assigned to the names column, and then correctly inferred the logical types for the integers and bools columns.

Next, we will look at what happens if we do not specify any logical types.

[6]:
dt = DataTable(data)
dt.types
[6]:
Physical Type Logical Type Semantic Tag(s)
Data Column
integers Int64 Integer {numeric}
bools boolean Boolean {}
names category Categorical {category}

In this case, Woodwork correctly inferred type for the integers and bools columns, but failed to recognize the names column should have a logical type of FullName. In situations like this, Woodwork provides users the ability to change the logical type.

Let’s update the logical type of the names column to be FullName:

[7]:
dt = dt.set_types(logical_types={'names': 'FullName'})
dt.types
[7]:
Physical Type Logical Type Semantic Tag(s)
Data Column
integers Int64 Integer {numeric}
bools boolean Boolean {}
names string FullName {}

If we look carefully at the output, we will see that several things happened to the names column. First of all, the correct FullName logical type is now applied. Second, the physical type of the column has been changed from category to string to match the standard physical type for the FullName logical type. Finally, the standard tag of category that was previously set for the names column has been removed, as it no longer applies.

When setting the LogicalType for a column, the type can be specified by passing a string representing the camel-case name of the LogicalType class as we have done above. Alternatively, users can pass the class directly instead of a string, or the snake-case name of the string. All of these would be valid values to use for setting the FullName Logical type: FullName, "FullName" or "full_name". Note, in order to use the class name the class must first be imported.

Working with Semantic Tags

Woodwork provides several methods for working with semantic types. Users can add and remove specific tags, as well as reset the tags to their default values. In this section, we will demonstrate these methods.

Note

When calling any of the methods that modify the logical type or semantic tags for a column, a new DataTable object will be returned and the original DataTable will be left unchanged. If you need to use the modified table, be sure to assign the returned DataTable to a variable to allow for proper access to it.

Standard Tags

As mentioned above, by default Woodwork will apply default semantic tags to columns, based on the logical type that was specified or inferred. If this behavior is undesirable, it can be controlled by setting the parameter use_standard_tags to False when creating the DataTable:

[8]:
dt = DataTable(data, use_standard_tags=False)
dt.types
[8]:
Physical Type Logical Type Semantic Tag(s)
Data Column
integers Int64 Integer {}
bools boolean Boolean {}
names category Categorical {}

As can be seen in the table above, when creating a DataTable with the use_standard_tags set to False, all semantic tags will be empty. The only exception to this is if the index or time index column were set, and this will be discussed in more detail below.

Now, lets create a new table with the standard tags, and specify some additional user-defined semantic tags during creation:

[9]:
semantic_tags = {
    'bools': 'user_status',
    'names': 'legal_name'
}
dt = DataTable(data, semantic_tags=semantic_tags)
dt.types
[9]:
Physical Type Logical Type Semantic Tag(s)
Data Column
integers Int64 Integer {numeric}
bools boolean Boolean {user_status}
names category Categorical {legal_name, category}

Woodwork has applied the tags we specified along with any standard tags to the columns in our DataTable.

After creating the table, maybe we change our mind and decide we do not like the tag of user_status that was applied to the bools column and we want to remove it. We can do that with the remove_semantic_tags method.

[10]:
dt = dt.remove_semantic_tags({'bools':'user_status'})
dt.types
[10]:
Physical Type Logical Type Semantic Tag(s)
Data Column
integers Int64 Integer {numeric}
bools boolean Boolean {}
names category Categorical {legal_name, category}

As you can see, the user_status tag has now been removed.

Multiple tags can also be added to a column at once by passing a list of tags to add, instead of a single tag. Similarly, multiple tags can be removed at once, by passing a list of tags.

[11]:
dt = dt.add_semantic_tags({'bools':['tag1', 'tag2']})
dt.types
[11]:
Physical Type Logical Type Semantic Tag(s)
Data Column
integers Int64 Integer {numeric}
bools boolean Boolean {tag1, tag2}
names category Categorical {legal_name, category}
[12]:
dt = dt.remove_semantic_tags({'bools':['tag1', 'tag2']})
dt.types
[12]:
Physical Type Logical Type Semantic Tag(s)
Data Column
integers Int64 Integer {numeric}
bools boolean Boolean {}
names category Categorical {legal_name, category}

Finally, all tags can be reset to their default values by using the reset_semantic_tags methods. If use_standard_tags is True, the tags will be reset to the standard tags. Otherwise, the tags will be reset to be empty sets.

[13]:
dt = dt.reset_semantic_tags()
dt.types
[13]:
Physical Type Logical Type Semantic Tag(s)
Data Column
integers Int64 Integer {numeric}
bools boolean Boolean {}
names category Categorical {category}

In this case, since we created our DataTable with the default behavior of using standard tags, calling reset_semantic_tags resulted in all of our semantic tags being reset to the standard tags for each column.

Index and Time Index Tags

When creating a DataTable users can optionally specify which column represents the index and which column represents the time index. If these columns are specified, semantic tags of index and time_index will be applied to the specified column. Behind the scenes, Woodwork is performing additional validation checks on the columns to make sure they are appropriate. For example, index columns must be unique, and time index columns must contain datetime values or values that can be converted to datetimes.

Because of the need for these validation checks, users cannot set the index or time_index tags directly on a column. In order to designate a column as the index, the set_index method should be used. Similarly, in order to set the time index column, the set_time_index method should be used. Optionally, these can be specified when initially creating the DataTable.

Setting or changing the index

Let’s create a new sample data set that contains columns that can be used as index and time index columns by using the index or time_index parameters.

[14]:
data = pd.DataFrame({
    'index': [0, 1, 2],
    'id': [1, 2, 3],
    'times': pd.to_datetime(['2020-09-01', '2020-09-02', '2020-09-03']),
    'numbers': [10, 20, 30]
})

dt = DataTable(data)
dt.types
[14]:
Physical Type Logical Type Semantic Tag(s)
Data Column
index Int64 Integer {numeric}
id Int64 Integer {numeric}
times datetime64[ns] Datetime {}
numbers Int64 Integer {numeric}

Without specifying an index or time index column during creation of the DataTable, Woodwork has inferred that the index and id columns are whole numbers and the numeric semantic tag has been applied. We can now set the index column with the set_index method:

[15]:
dt = dt.set_index('index')
dt.types
[15]:
Physical Type Logical Type Semantic Tag(s)
Data Column
index Int64 Integer {index}
id Int64 Integer {numeric}
times datetime64[ns] Datetime {}
numbers Int64 Integer {numeric}

Inspecting the types now reveals that the index semantic tag has been added to the index column, and the numeric standard tag has been removed. We can also check that the index has been set correctly by checking the value of the DataTable.index attribute:

[16]:
dt.index
[16]:
'index'

Perhaps we wanted to change the index column to be the id column instead. We can do this simply with another call to set_index.

[17]:
dt = dt.set_index('id')
dt.types
[17]:
Physical Type Logical Type Semantic Tag(s)
Data Column
index Int64 Integer {}
id Int64 Integer {index}
times datetime64[ns] Datetime {}
numbers Int64 Integer {numeric}

Now we can see that the index tag has been removed from the index column and added to the id column. Also, of note, the numeric standard tag that was originally present on the index column has not been added back. If this tag is desired, the user must currently add it back using the add_semantic_tags method.

Setting or changing the time index

Setting the time index works similarly to setting the index. We can now set the time index for the DataTable with the set_time_index method.

[18]:
dt = dt.set_time_index('times')
dt.types
[18]:
Physical Type Logical Type Semantic Tag(s)
Data Column
index Int64 Integer {numeric}
id Int64 Integer {index}
times datetime64[ns] Datetime {time_index}
numbers Int64 Integer {numeric}

As you can see, after calling set_time_index, the time_index semantic tag has been added to the times column of the DataTable.