In this guide, you walk through an example where you create a Woodwork DataTable. Along the way, you learn how to update and remove logical types and semantic tags. You also learn how to use the typing information to select subsets of data.
Woodwork relies heavily on the concepts of physical types, logical types and semantic tags. These concepts are covered in detail in Understanding Types and Tags, but we provide brief definitions here for reference:
Physical Type: defines how the data is stored on disk or in memory.
Logical Type: defines how the data should be parsed or interpreted.
Semantic Tag(s): provides additional data about the meaning of the data or how it should be used.
Start learning how to use Woodwork by creating a dataframe that contains retail sales data.
[1]:
import woodwork as ww data = ww.demo.load_retail(nrows=100, return_dataframe=True) data.head(5)
As you can see, this is a dataframe containing several different data types, including dates, categorical values, numeric values, and natural language descriptions. Next, use Woodwork to create a DataTable from this data.
Creating a Woodwork DataTable is as simple as passing in a dataframe with the data of interest during initialization. An optional name parameter can be specified to label the DataTable.
[2]:
dt = ww.DataTable(data, name="retail") dt
Using just this simple call, Woodwork was able to infer the logical types present in the data by analyzing the dataframe dtypes as well as the information contained in the columns. In addition, Woodwork also added semantic tags to some of the columns based on the logical types that were inferred.
You can also view the typing information along with the first few columns of data.
[3]:
dt.head()
If the initial inference was not to our liking, the logical type can be changed to a more appropriate value. Let’s change some of the columns to a different logical type to illustrate this process. In this case, set the logical type for the quantity, customer_name, and country columns to be Categorical.
quantity
customer_name
country
Categorical
[4]:
dt = dt.set_types(logical_types={ 'quantity': 'Categorical', 'customer_name': 'Categorical', 'country': 'Categorical' }) dt
Inspect the information in the types output. There, you can see that the Logical type for the three columns has been updated with the Categorical logical type you specified.
types
Now that you’ve prepared logical types, you can select a subset of the columns based on their logical types. Select only the columns that have a logical type of Integer or Double.
Integer
Double
[5]:
numeric_dt = dt.select(['Integer', 'Double']) numeric_dt
This selection process has returned a new DataTable containing only the columns that match the logical types you specified. After you have selected the columns you want, you can also access a dataframe containing just those columns if you need it for additional analysis.
DataTable
[6]:
numeric_dt.to_dataframe()
100 rows × 4 columns
Note
Accessing the dataframe associated with a DataTable by using dt.to_dataframe() returns a reference to the dataframe. Modifications to the returned dataframe can cause unexpected results. If you need to modify the dataframe, you should use dt.to_dataframe().copy() to return a copy of the stored dataframe that can be safely modified without impacting the DataTable behavior.
dt.to_dataframe()
dt.to_dataframe().copy()
Next, let’s add semantic tags to some of the columns. Add the tag of product_details to the description column, and tag the total column with currency.
product_details
description
total
currency
[7]:
dt = dt.set_types(semantic_tags={'description':'product_details', 'total': 'currency'}) dt
Select columns based on a semantic tag. Only select the columns tagged with category.
category
[8]:
category_dt = dt.select('category') category_dt
Select columns using multiple semantic tags or a mixture of semantic tags and logical types.
[9]:
category_numeric_dt = dt.select(['numeric', 'category']) category_numeric_dt
[10]:
mixed_dt = dt.select(['Boolean', 'product_details']) mixed_dt
To select an individual column, specify the column name. You can then get access to the data in the DataColumn using the to_series method.
to_series
[11]:
dc = dt['total'] dc
<DataColumn: total (Physical Type = float64) (Logical Type = Double) (Semantic Tags = {'numeric', 'currency'})>
[12]:
dc.to_series()
0 25.245 1 33.561 2 36.300 3 33.561 4 33.561 ... 95 25.245 96 83.160 97 21.780 98 21.780 99 21.780 Name: total, Length: 100, dtype: float64
Access multiple columns by supplying a list of column names.
[13]:
multiple_cols_dt = dt[['product_id', 'total', 'unit_price']] multiple_cols_dt
Remove specific semantic tags from a column if they are no longer needed. In this example, remove the product_details tag from the description column.
[14]:
dt = dt.remove_semantic_tags({'description':'product_details'}) dt
Notice how the product_details tag has been removed from the description column. If you want to remove all user-added semantic tags from all columns, you can do that, too.
[15]:
dt = dt.reset_semantic_tags() dt
At any point, you can designate certain columns as the DataTable’s index or time_index with the methods set_index and set_time_index. These methods can be used to assign these columns for the first time or to change the column being used as the index or time index.
index
time_index
Index and time index columns contain index and time_index semantic tags, respectively.
[16]:
dt = dt.set_index('order_product_id') dt.index
'order_product_id'
[17]:
dt = dt.set_time_index('order_date') dt.time_index
'order_date'
[18]:
dt
Retrieve all the Logical Types present in Woodwork. These can be useful for understanding the Logical Types, as well as how they are interpreted.
[19]:
from woodwork.type_sys.utils import list_logical_types list_logical_types()