woodwork.utils.read_file#
- woodwork.utils.read_file(filepath=None, content_type=None, name=None, index=None, time_index=None, semantic_tags=None, logical_types=None, use_standard_tags=True, column_origins=None, replace_nan=False, validate=True, **kwargs)[source]#
Read data from the specified file and return a DataFrame with initialized Woodwork typing information.
- Note:
As the engine fastparquet cannot handle nullable pandas dtypes, pyarrow will be used for reading from parquet and arrow.
- Parameters:
filepath (str) – A valid string path to the file to read
content_type (str) – Content type of file to read
name (str, optional) – Name used to identify the DataFrame.
index (str, optional) – Name of the index column.
time_index (str, optional) – Name of the time index column.
semantic_tags (dict, optional) – Dictionary mapping column names in the dataframe to the semantic tags for the column. The keys in the dictionary should be strings that correspond to columns in the underlying dataframe. There are two options for specifying the dictionary values: (str): If only one semantic tag is being set, a single string can be used as a value. (list[str] or set[str]): If multiple tags are being set, a list or set of strings can be used as the value. Semantic tags will be set to an empty set for any column not included in the dictionary.
logical_types (dict[str -> LogicalType], optional) – Dictionary mapping column names in the dataframe to the LogicalType for the column. LogicalTypes will be inferred for any columns not present in the dictionary.
use_standard_tags (bool, optional) – If True, will add standard semantic tags to columns based on the inferred or specified logical type for the column. Defaults to True.
column_origins (str or dict[str -> str], optional) – Origin of each column. If a string is supplied, it is used as the origin for all columns. A dictionary can be used to set origins for individual columns.
replace_nan (bool, optional) – Whether to replace empty string values and string representations of NaN values (“nan”, “<NA>”) with np.nan or pd.NA values based on column dtype. Defaults to False.
validate (bool, optional) – Whether parameter and data validation should occur. Defaults to True. Warning: Should be set to False only when parameters and data are known to be valid. Any errors resulting from skipping validation with invalid inputs may not be easily understood.
**kwargs – Additional keyword arguments to pass to the underlying pandas read file function. For more information on available keywords refer to the pandas documentation.
- Returns:
DataFrame created from the specified file with Woodwork typing information initialized.
- Return type:
pd.DataFrame