Global Configuration Options#

Woodwork contains global configuration options that you can use to control the behavior of certain aspects of Woodwork. This guide provides an overview of working with those options, including viewing the current settings and updating the config values.

Viewing Config Settings#

To demonstrate how to display the current configuration options, follow along.

After you’ve imported Woodwork, you can view the options with ww.config as shown below.

[1]:
import woodwork as ww

ww.config
[1]:
Woodwork Global Config Settings
-------------------------------
categorical_threshold: 0.2
numeric_categorical_threshold: None
email_inference_regex: (^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)
url_inference_regex: (http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)
ipv4_inference_regex: (^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$)
ipv6_inference_regex: (([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))
phone_inference_regex: (?:\+?(0{2})?1[-.\s●]?)?\(?([2-9][0-9]{2})\)?[-\.\s●]?([2-9][0-9]{2})[-\.\s●]?([0-9]{4})$
postal_code_inference_regex: ^[0-9]{5}(?:-[0-9]{4})?$
nan_values: ['', ' ', None, nan, NaT, 'None', 'NONE', 'none', 'NULL', 'Null', 'null', 'NAN', 'NaN', 'Nan', 'nan', 'NA', 'na', 'N/A', 'n/a', 'n/A', 'N/a', '<NA>', '<N/A>', '<n/a>', '<na>']
frequence_inference_window_length: 15
frequence_inference_threshold: 0.9
correlation_metrics: ['mutual_info', 'pearson', 'spearman', 'max', 'all']
medcouple_threshold: 0.3
medcouple_sample_size: 10000
boolean_inference_strings: {frozenset({'no', 'yes'}), frozenset({'y', 'n'}), frozenset({'false', 'true'}), frozenset({'f', 't'})}
boolean_transform_mappings: {'yes': True, 'no': False, 'y': True, 'n': False, 'true': True, 'false': False, 't': True, 'f': False}
boolean_inference_ints: {}

The output of ww.config lists each of the available config variables followed by its current setting. In the output above, the settings for the categorical_threshold and numeric_categorical_threshold config variables are visible.

Updating Config Settings#

Updating a config variable is done simply with a call to the ww.config.set_option function. This function requires two arguments: the name of the config variable to update and the new value to set.

As an example, update the categorical_threshold config variable to have a value of 0.5 instead of the default value.

[2]:
ww.config.set_option("categorical_threshold", 0.5)
ww.config
[2]:
Woodwork Global Config Settings
-------------------------------
categorical_threshold: 0.5
numeric_categorical_threshold: None
email_inference_regex: (^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)
url_inference_regex: (http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)
ipv4_inference_regex: (^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$)
ipv6_inference_regex: (([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))
phone_inference_regex: (?:\+?(0{2})?1[-.\s●]?)?\(?([2-9][0-9]{2})\)?[-\.\s●]?([2-9][0-9]{2})[-\.\s●]?([0-9]{4})$
postal_code_inference_regex: ^[0-9]{5}(?:-[0-9]{4})?$
nan_values: ['', ' ', None, nan, NaT, 'None', 'NONE', 'none', 'NULL', 'Null', 'null', 'NAN', 'NaN', 'Nan', 'nan', 'NA', 'na', 'N/A', 'n/a', 'n/A', 'N/a', '<NA>', '<N/A>', '<n/a>', '<na>']
frequence_inference_window_length: 15
frequence_inference_threshold: 0.9
correlation_metrics: ['mutual_info', 'pearson', 'spearman', 'max', 'all']
medcouple_threshold: 0.3
medcouple_sample_size: 10000
boolean_inference_strings: {frozenset({'no', 'yes'}), frozenset({'y', 'n'}), frozenset({'false', 'true'}), frozenset({'f', 't'})}
boolean_transform_mappings: {'yes': True, 'no': False, 'y': True, 'n': False, 'true': True, 'false': False, 't': True, 'f': False}
boolean_inference_ints: {}

As you can see from the output above, the value for the categorical_threshold config variable has been updated to 0.5.

Temporarily Updating Config Settings#

Settings can also be temporarily updated in the context of a with block by using ww.config.with_options:

[3]:
with ww.config.with_options(categorical_threshold=0.7):
    # Do something
    print("Temporary settings:\n")
    print(repr(ww.config), "\n")

print("Restored settings:\n")
print(repr(ww.config))
Temporary settings:

Woodwork Global Config Settings
-------------------------------
categorical_threshold: 0.7
numeric_categorical_threshold: None
email_inference_regex: (^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)
url_inference_regex: (http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)
ipv4_inference_regex: (^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$)
ipv6_inference_regex: (([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))
phone_inference_regex: (?:\+?(0{2})?1[-.\s●]?)?\(?([2-9][0-9]{2})\)?[-\.\s●]?([2-9][0-9]{2})[-\.\s●]?([0-9]{4})$
postal_code_inference_regex: ^[0-9]{5}(?:-[0-9]{4})?$
nan_values: ['', ' ', None, nan, NaT, 'None', 'NONE', 'none', 'NULL', 'Null', 'null', 'NAN', 'NaN', 'Nan', 'nan', 'NA', 'na', 'N/A', 'n/a', 'n/A', 'N/a', '<NA>', '<N/A>', '<n/a>', '<na>']
frequence_inference_window_length: 15
frequence_inference_threshold: 0.9
correlation_metrics: ['mutual_info', 'pearson', 'spearman', 'max', 'all']
medcouple_threshold: 0.3
medcouple_sample_size: 10000
boolean_inference_strings: {frozenset({'no', 'yes'}), frozenset({'y', 'n'}), frozenset({'false', 'true'}), frozenset({'f', 't'})}
boolean_transform_mappings: {'yes': True, 'no': False, 'y': True, 'n': False, 'true': True, 'false': False, 't': True, 'f': False}
boolean_inference_ints: {}

Restored settings:

Woodwork Global Config Settings
-------------------------------
categorical_threshold: 0.5
numeric_categorical_threshold: None
email_inference_regex: (^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)
url_inference_regex: (http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)
ipv4_inference_regex: (^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$)
ipv6_inference_regex: (([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))
phone_inference_regex: (?:\+?(0{2})?1[-.\s●]?)?\(?([2-9][0-9]{2})\)?[-\.\s●]?([2-9][0-9]{2})[-\.\s●]?([0-9]{4})$
postal_code_inference_regex: ^[0-9]{5}(?:-[0-9]{4})?$
nan_values: ['', ' ', None, nan, NaT, 'None', 'NONE', 'none', 'NULL', 'Null', 'null', 'NAN', 'NaN', 'Nan', 'nan', 'NA', 'na', 'N/A', 'n/a', 'n/A', 'N/a', '<NA>', '<N/A>', '<n/a>', '<na>']
frequence_inference_window_length: 15
frequence_inference_threshold: 0.9
correlation_metrics: ['mutual_info', 'pearson', 'spearman', 'max', 'all']
medcouple_threshold: 0.3
medcouple_sample_size: 10000
boolean_inference_strings: {frozenset({'no', 'yes'}), frozenset({'y', 'n'}), frozenset({'false', 'true'}), frozenset({'f', 't'})}
boolean_transform_mappings: {'yes': True, 'no': False, 'y': True, 'n': False, 'true': True, 'false': False, 't': True, 'f': False}
boolean_inference_ints: {}

Get Value for a Specific Config Variable#

If you need access to the value that is set for a specific config variable you can access it with the ww.config.get_option function, passing in the name of the config variable for which you want the value.

[4]:
ww.config.get_option("categorical_threshold")
[4]:
0.5

Resetting to Default Values#

Config variables can be reset to their default values using the ww.config.reset_option function, passing in the name of the variable to reset.

As an example, reset the categorical_threshold config variable to its default value.

[5]:
ww.config.reset_option("categorical_threshold")
ww.config
[5]:
Woodwork Global Config Settings
-------------------------------
categorical_threshold: 0.2
numeric_categorical_threshold: None
email_inference_regex: (^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)
url_inference_regex: (http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)
ipv4_inference_regex: (^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$)
ipv6_inference_regex: (([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))
phone_inference_regex: (?:\+?(0{2})?1[-.\s●]?)?\(?([2-9][0-9]{2})\)?[-\.\s●]?([2-9][0-9]{2})[-\.\s●]?([0-9]{4})$
postal_code_inference_regex: ^[0-9]{5}(?:-[0-9]{4})?$
nan_values: ['', ' ', None, nan, NaT, 'None', 'NONE', 'none', 'NULL', 'Null', 'null', 'NAN', 'NaN', 'Nan', 'nan', 'NA', 'na', 'N/A', 'n/a', 'n/A', 'N/a', '<NA>', '<N/A>', '<n/a>', '<na>']
frequence_inference_window_length: 15
frequence_inference_threshold: 0.9
correlation_metrics: ['mutual_info', 'pearson', 'spearman', 'max', 'all']
medcouple_threshold: 0.3
medcouple_sample_size: 10000
boolean_inference_strings: {frozenset({'no', 'yes'}), frozenset({'y', 'n'}), frozenset({'false', 'true'}), frozenset({'f', 't'})}
boolean_transform_mappings: {'yes': True, 'no': False, 'y': True, 'n': False, 'true': True, 'false': False, 't': True, 'f': False}
boolean_inference_ints: {}

Available Config Settings#

This section provides an overview of the current config options that can be set within Woodwork.

Categorical Threshold#

The categorical_threshold config variable helps control the distinction between Categorical and other logical types during type inference. More specifically, this threshold represents the maximum acceptable ratio of unique value count to total value count (excluding nan values from either count) in a series for that series to be inferred as categorical. In other words, if the values in a series are fully accounted for by a relatively small collection of unique values, then the series is categorical. The categorical_threshold config variable defaults to 0.2. This indicates that, by default, a series for which the unique value count is 20% of the total value count could be inferred as categorical.

Numeric Categorical Threshold#

Woodwork provides the option to infer numeric columns as the Categorical logical type if they have few enough unique values. The numeric_categorical_threshold controls this behavior. The default value for numeric_categorical_threshold is None, meaning that by default numeric columns should never be inferred to be categorical. If the setting is given a float between 0 and 1 as a value, then it behaves in the same manner as the categorical_threshold setting except that it only applies to columns with a numeric dtype (float or integer).

Email Inference Regex#

Woodwork provides the option to infer string columns as the EmailAddress logical type if a representative sample of valid (non-missing) rows all match a given regular expression. The email_inference_regex config variable allows users to set the regular expression that is used during this matching process. The default regex is r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)" (taken from https://emailregex.com/).

URL Inference Regex#

Woodwork provides the option to infer string columns as the URL logical type if a representative sample of valid (non-missing) rows all match a given regular expression. The url_inference_regex config variable allows users to set the regular expression that is used during this matching process. The default regex is r\"(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)\" (taken from https://urlregex.com/).

IP Address Inference Regex#

Woodwork provides the option to infer string columns as the IPAddress logical type if a representative sample of valid (non-missing) rows all match a given regular expression. The ipv4_inference_regex and ipv6_inference_regex config variables allow users to set the regular expressions that are used during this matching process. The default for ipv4_inference_regex is taken from https://ipregex.com/ and the default for ipv6_inference_regex is taken from https://ihateregex.io/expr/ipv6/.

Frequency Inference Window Length#

Woodwork provides the option to infer frequency on columns which have temporal logical types. The frequence_inference_window_length configuration object is used to determine the length of the sliding window that is used for inference. The window length needs to be long enough to capture some frequency like Business Days, “B”, etc. The default value is 15.

Frequency Inference Threshold#

Woodwork provides the option to infer frequency on columns which have temporal logical types. The frequence_inference_threshold configuration object is used to determine the number of windows that satisify a given frequency out of all windows. For example, if 91 windows of 100 adhere to the frequency “H”, then we can confidently assume the correct frequency for this data is “H”. The default value is 0.9.