Calculates statistics for data contained in the DataFrame.
include (list[str or LogicalType], optional) – filter for what columns to include in the
statistics returned. Can be a list of column names, semantic tags, logical types, or a list
combining any of the three. It follows the most broad specification. Favors logical types
then semantic tag then column name. If no matching columns are found, an empty DataFrame
will be returned.
callback (callable, optional) –
function to be called with incremental updates. Has the following parameters:
update (int): change in progress since last call
progress (int): the progress so far in the calculations
total (int): the total number of calculations to do
unit (str): unit of measurement for progress/total
time_elapsed (float): total time in seconds elapsed since start of call
extra_stats (bool) – If True, will calculate a histogram for numeric columns, top values
for categorical columns and value counts for the most recent values in datetime columns. Will also
calculate value counts within the range of values present for integer columns if the range of
values present is less than or equal to than the number of bins used to compute the histogram.
Output can be controlled by bins, top_x and recent_x parameters.
bins (int) – Number of bins to use when calculating histogram for numeric columns. Defaults to 10.
Will be ignored unless extra_stats=True.
top_x (int) – Number of items to return when getting the most frequently occurring values for categorical
columns. Defaults to 10. Will be ignored unless extra_stats=True.
recent_x (int) – Number of values to return when calculating value counts for the most recent dates in
datetime columns. Defaults to 10. Will be ignored unless extra_stats=True.
A dictionary with a key for each column in the data or for each column
matching the logical types, semantic tags or column names specified in include, paired
with a value containing a dictionary containing relevant statistics for that column.
Dict[str -> dict]