Global Configuration Options#
Woodwork contains global configuration options that you can use to control the behavior of certain aspects of Woodwork. This guide provides an overview of working with those options, including viewing the current settings and updating the config values.
Viewing Config Settings#
To demonstrate how to display the current configuration options, follow along.
After you’ve imported Woodwork, you can view the options with ww.config
as shown below.
[1]:
import woodwork as ww
ww.config
[1]:
Woodwork Global Config Settings
-------------------------------
categorical_threshold: 0.2
numeric_categorical_threshold: None
email_inference_regex: (^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)
url_inference_regex: (http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)
ipv4_inference_regex: (^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$)
ipv6_inference_regex: (([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))
phone_inference_regex: (?:\+?(0{2})?1[-.\s●]?)?\(?([2-9][0-9]{2})\)?[-\.\s●]?([2-9][0-9]{2})[-\.\s●]?([0-9]{4})$
postal_code_inference_regex: ^[0-9]{5}(?:-[0-9]{4})?$
nan_values: ['', ' ', None, nan, NaT, 'None', 'NONE', 'none', 'NULL', 'Null', 'null', 'NAN', 'NaN', 'Nan', 'nan', 'NA', 'na', 'N/A', 'n/a', 'n/A', 'N/a', '<NA>', '<N/A>', '<n/a>', '<na>']
frequence_inference_window_length: 15
frequence_inference_threshold: 0.9
correlation_metrics: ['mutual_info', 'pearson', 'spearman', 'max', 'all']
medcouple_threshold: 0.3
medcouple_sample_size: 10000
boolean_inference_strings: {frozenset({'no', 'yes'}), frozenset({'y', 'n'}), frozenset({'false', 'true'}), frozenset({'f', 't'})}
boolean_transform_mappings: {'yes': True, 'no': False, 'y': True, 'n': False, 'true': True, 'false': False, 't': True, 'f': False}
boolean_inference_ints: {}
The output of ww.config
lists each of the available config variables followed by its current setting. In the output above, the settings for the categorical_threshold
and numeric_categorical_threshold
config variables are visible.
Updating Config Settings#
Updating a config variable is done simply with a call to the ww.config.set_option
function. This function requires two arguments: the name of the config variable to update and the new value to set.
As an example, update the categorical_threshold
config variable to have a value of 0.5
instead of the default value.
[2]:
ww.config.set_option("categorical_threshold", 0.5)
ww.config
[2]:
Woodwork Global Config Settings
-------------------------------
categorical_threshold: 0.5
numeric_categorical_threshold: None
email_inference_regex: (^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)
url_inference_regex: (http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)
ipv4_inference_regex: (^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$)
ipv6_inference_regex: (([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))
phone_inference_regex: (?:\+?(0{2})?1[-.\s●]?)?\(?([2-9][0-9]{2})\)?[-\.\s●]?([2-9][0-9]{2})[-\.\s●]?([0-9]{4})$
postal_code_inference_regex: ^[0-9]{5}(?:-[0-9]{4})?$
nan_values: ['', ' ', None, nan, NaT, 'None', 'NONE', 'none', 'NULL', 'Null', 'null', 'NAN', 'NaN', 'Nan', 'nan', 'NA', 'na', 'N/A', 'n/a', 'n/A', 'N/a', '<NA>', '<N/A>', '<n/a>', '<na>']
frequence_inference_window_length: 15
frequence_inference_threshold: 0.9
correlation_metrics: ['mutual_info', 'pearson', 'spearman', 'max', 'all']
medcouple_threshold: 0.3
medcouple_sample_size: 10000
boolean_inference_strings: {frozenset({'no', 'yes'}), frozenset({'y', 'n'}), frozenset({'false', 'true'}), frozenset({'f', 't'})}
boolean_transform_mappings: {'yes': True, 'no': False, 'y': True, 'n': False, 'true': True, 'false': False, 't': True, 'f': False}
boolean_inference_ints: {}
As you can see from the output above, the value for the categorical_threshold
config variable has been updated to 0.5
.
Temporarily Updating Config Settings#
Settings can also be temporarily updated in the context of a with block by using ww.config.with_options
:
[3]:
with ww.config.with_options(categorical_threshold=0.7):
# Do something
print("Temporary settings:\n")
print(repr(ww.config), "\n")
print("Restored settings:\n")
print(repr(ww.config))
Temporary settings:
Woodwork Global Config Settings
-------------------------------
categorical_threshold: 0.7
numeric_categorical_threshold: None
email_inference_regex: (^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)
url_inference_regex: (http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)
ipv4_inference_regex: (^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$)
ipv6_inference_regex: (([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))
phone_inference_regex: (?:\+?(0{2})?1[-.\s●]?)?\(?([2-9][0-9]{2})\)?[-\.\s●]?([2-9][0-9]{2})[-\.\s●]?([0-9]{4})$
postal_code_inference_regex: ^[0-9]{5}(?:-[0-9]{4})?$
nan_values: ['', ' ', None, nan, NaT, 'None', 'NONE', 'none', 'NULL', 'Null', 'null', 'NAN', 'NaN', 'Nan', 'nan', 'NA', 'na', 'N/A', 'n/a', 'n/A', 'N/a', '<NA>', '<N/A>', '<n/a>', '<na>']
frequence_inference_window_length: 15
frequence_inference_threshold: 0.9
correlation_metrics: ['mutual_info', 'pearson', 'spearman', 'max', 'all']
medcouple_threshold: 0.3
medcouple_sample_size: 10000
boolean_inference_strings: {frozenset({'no', 'yes'}), frozenset({'y', 'n'}), frozenset({'false', 'true'}), frozenset({'f', 't'})}
boolean_transform_mappings: {'yes': True, 'no': False, 'y': True, 'n': False, 'true': True, 'false': False, 't': True, 'f': False}
boolean_inference_ints: {}
Restored settings:
Woodwork Global Config Settings
-------------------------------
categorical_threshold: 0.5
numeric_categorical_threshold: None
email_inference_regex: (^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)
url_inference_regex: (http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)
ipv4_inference_regex: (^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$)
ipv6_inference_regex: (([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))
phone_inference_regex: (?:\+?(0{2})?1[-.\s●]?)?\(?([2-9][0-9]{2})\)?[-\.\s●]?([2-9][0-9]{2})[-\.\s●]?([0-9]{4})$
postal_code_inference_regex: ^[0-9]{5}(?:-[0-9]{4})?$
nan_values: ['', ' ', None, nan, NaT, 'None', 'NONE', 'none', 'NULL', 'Null', 'null', 'NAN', 'NaN', 'Nan', 'nan', 'NA', 'na', 'N/A', 'n/a', 'n/A', 'N/a', '<NA>', '<N/A>', '<n/a>', '<na>']
frequence_inference_window_length: 15
frequence_inference_threshold: 0.9
correlation_metrics: ['mutual_info', 'pearson', 'spearman', 'max', 'all']
medcouple_threshold: 0.3
medcouple_sample_size: 10000
boolean_inference_strings: {frozenset({'no', 'yes'}), frozenset({'y', 'n'}), frozenset({'false', 'true'}), frozenset({'f', 't'})}
boolean_transform_mappings: {'yes': True, 'no': False, 'y': True, 'n': False, 'true': True, 'false': False, 't': True, 'f': False}
boolean_inference_ints: {}
Get Value for a Specific Config Variable#
If you need access to the value that is set for a specific config variable you can access it with the ww.config.get_option
function, passing in the name of the config variable for which you want the value.
[4]:
ww.config.get_option("categorical_threshold")
[4]:
0.5
Resetting to Default Values#
Config variables can be reset to their default values using the ww.config.reset_option
function, passing in the name of the variable to reset.
As an example, reset the categorical_threshold
config variable to its default value.
[5]:
ww.config.reset_option("categorical_threshold")
ww.config
[5]:
Woodwork Global Config Settings
-------------------------------
categorical_threshold: 0.2
numeric_categorical_threshold: None
email_inference_regex: (^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)
url_inference_regex: (http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)
ipv4_inference_regex: (^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$)
ipv6_inference_regex: (([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))
phone_inference_regex: (?:\+?(0{2})?1[-.\s●]?)?\(?([2-9][0-9]{2})\)?[-\.\s●]?([2-9][0-9]{2})[-\.\s●]?([0-9]{4})$
postal_code_inference_regex: ^[0-9]{5}(?:-[0-9]{4})?$
nan_values: ['', ' ', None, nan, NaT, 'None', 'NONE', 'none', 'NULL', 'Null', 'null', 'NAN', 'NaN', 'Nan', 'nan', 'NA', 'na', 'N/A', 'n/a', 'n/A', 'N/a', '<NA>', '<N/A>', '<n/a>', '<na>']
frequence_inference_window_length: 15
frequence_inference_threshold: 0.9
correlation_metrics: ['mutual_info', 'pearson', 'spearman', 'max', 'all']
medcouple_threshold: 0.3
medcouple_sample_size: 10000
boolean_inference_strings: {frozenset({'no', 'yes'}), frozenset({'y', 'n'}), frozenset({'false', 'true'}), frozenset({'f', 't'})}
boolean_transform_mappings: {'yes': True, 'no': False, 'y': True, 'n': False, 'true': True, 'false': False, 't': True, 'f': False}
boolean_inference_ints: {}
Available Config Settings#
This section provides an overview of the current config options that can be set within Woodwork.
Categorical Threshold#
The categorical_threshold
config variable helps control the distinction between Categorical
and other logical types during type inference. More specifically, this threshold represents the maximum acceptable ratio of unique value count to total value count (excluding nan values from either count) in a series for that series to be inferred as categorical. In other words, if the values in a series are fully accounted for by a relatively small collection of unique values, then the series is
categorical. The categorical_threshold
config variable defaults to 0.2
. This indicates that, by default, a series for which the unique value count is 20% of the total value count could be inferred as categorical.
Numeric Categorical Threshold#
Woodwork provides the option to infer numeric columns as the Categorical
logical type if they have few enough unique values. The numeric_categorical_threshold
controls this behavior. The default value for numeric_categorical_threshold
is None
, meaning that by default numeric columns should never be inferred to be categorical. If the setting is given a float between 0
and 1
as a value, then it behaves in the same manner as the categorical_threshold
setting except that
it only applies to columns with a numeric dtype (float or integer).
Email Inference Regex#
Woodwork provides the option to infer string columns as the EmailAddress
logical type if a representative sample of valid (non-missing) rows all match a given regular expression. The email_inference_regex
config variable allows users to set the regular expression that is used during this matching process. The default regex is r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)"
(taken from https://emailregex.com/).
URL Inference Regex#
Woodwork provides the option to infer string columns as the URL
logical type if a representative sample of valid (non-missing) rows all match a given regular expression. The url_inference_regex
config variable allows users to set the regular expression that is used during this matching process. The default regex is r\"(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)\"
(taken from https://urlregex.com/).
IP Address Inference Regex#
Woodwork provides the option to infer string columns as the IPAddress
logical type if a representative sample of valid (non-missing) rows all match a given regular expression. The ipv4_inference_regex
and ipv6_inference_regex
config variables allow users to set the regular expressions that are used during this matching process. The default for ipv4_inference_regex
is taken from https://ipregex.com/ and the default for ipv6_inference_regex
is taken from
https://ihateregex.io/expr/ipv6/.
Frequency Inference Window Length#
Woodwork provides the option to infer frequency on columns which have temporal logical types. The frequence_inference_window_length
configuration object is used to determine the length of the sliding window that is used for inference. The window length needs to be long enough to capture some frequency like Business Days, “B”, etc. The default value is 15.
Frequency Inference Threshold#
Woodwork provides the option to infer frequency on columns which have temporal logical types. The frequence_inference_threshold
configuration object is used to determine the number of windows that satisify a given frequency out of all windows. For example, if 91 windows of 100 adhere to the frequency “H”, then we can confidently assume the correct frequency for this data is “H”. The default value is 0.9.