Release Notes#

Future Release#

  • Enhancements

  • Fixes

  • Changes
  • Documentation Changes

  • Testing Changes

Thanks to the following people for contributing to this release: @thehomebrewnerd

v0.31.0 May 13, 2024#

  • Enhancements
    • Add support for Python 3.12 GH#1855

  • Changes
    • Drop support for using Woodwork with Dask or Pyspark dataframes GH#1857

    • Use filter arg in call to tarfile.extractall to safely deserialize DataFrames GH#1862

Thanks to the following people for contributing to this release: @thehomebrewnerd

Breaking Changes#

  • With this release, Woodwork can no longer be used with Dask or Pyspark dataframes. The behavior when using pandas dataframes remains unchanged.

v0.30.0 Apr 10, 2024#

Warning

Support for use with Dask and Pyspark dataframes is planned for removal in an upcoming release of Woodwork.

  • Changes
    • Temporarily restrict Dask version GH#1837

    • Updates for compatibility with Dask 2024.4.1 GH#1843

  • Testing Changes
    • Fix serialization test to work with pytest 8.1.1 GH#1837

Thanks to the following people for contributing to this release: @thehomebrewnerd

v0.29.0 Feb 26, 2024#

  • Changes
    • Remove numpy upper bound restriction in pyproject.toml GH#1819

    • Bump min version of python-dateutil for pandas 2.0 compatibility GH#1825

  • Testing Changes
    • Update release.yaml to use trusted publisher for PyPI releases GH#1819

    • Update latest dependency CI runs to include run with only core requirements GH#1822

Thanks to the following people for contributing to this release: @thehomebrewnerd

v0.28.0 Feb 5, 2024#

Warning

This release of Woodwork will not support Python 3.8

  • Changes
    • Upgraded numpy to < 2.0.0 GH#1799

  • Documentation Changes
    • Added dask string storage note to “Other Limitations” in Dask documentation GH#1799

  • Testing Changes

Thanks to the following people for contributing to this release: @cp2boston, @gsheni, @tamargrey

v0.27.0 Dec 12, 2023#

  • Fixes
    • Removed warning due to deprecated infer_datetime_format argument in pandas (GH#1785)

    • Fix GitHub Actions to kick off EvalML and Featuretools unit tests (GH#1795)

  • Changes * Temporarily restrict pyarrow version due to serialization issues (GH#1768) * Update pandas categorical type call and remove black with ruff formatter (GH#1794)

  • Testing Changes * Removed old performance testing workflow (GH#1776)

Thanks to the following people for contributing to this release: @eccabay, @gsheni, @thehomebrewnerd, @petejanuszewski1

v0.26.0 Aug 22, 2023#

  • Enhancements
    • Optimized Boolean inference by removing generation of mappings and sets of boolean values (GH#1713)

    • Speed up Boolean and Integer inference by caching results of corresponding nullable type inference (GH#1733)

  • Fixes
    • Update s3 bucket for docs image (GH#1749)

  • Documentation Changes
    • Update readthedocs config to use build.os (GH#1753)

    • Fix PyPI badge not showing on README.md (GH#1755)

Thanks to the following people for contributing to this release: @gsheni, @sbadithe, @simha104

v0.25.1 Jul 18, 2023#

  • Fixes
    • Restrict numpy version to resolve boolean inference issue with v1.25.0 GH#1735

Thanks to the following people for contributing to this release: @thehomebrewnerd

v0.25.0 Jul 17, 2023#

  • Enhancements
    • Force datetime guesser input to be string GH#1724

    • Add support for pandas v2.0.0 GH#1729

  • Changes
    • Remove upper bound restriction on dask version GH#1729

  • Testing Changes
    • Remove autouse=True from latlong dataframe fixtures GH#1729

Thanks to the following people for contributing to this release: @christopherbunn, @thehomebrewnerd

v0.24.0 May 24, 2023#

  • Enhancements
    • Removed repeated sorting for numeric data in _get_describe_dict to improve performance (GH#1682)

    • Improved inference for URL, EmailAddress, and other logical types by defining new parent-child relationships (GH#1702)

    • Added an include_time_index argument when calculating dependence measures (GH#1698)

  • Changes
    • Stopped calculating top_values for Double columns with integer values (GH#1692)

  • Testing Changes
    • Add Python 3.11 markers, add 3.11 for unit tests & install test (GH#1678)

    • Run looking glass performance tests on merge via Airflow (GH#1695)

Thanks to the following people for contributing to this release: @bchen1116, @gsheni, @ParthivNaresh, @petejanuszewski1, @simha104, @tamargrey

v0.23.0 April 12, 2023#

  • Fixes
    • Updated Datetime format inference to include formats with two digit year dates along with timezones (GH#1666)

  • Changes
    • Updated add_type and remove_type to include a treatment argument (GH#1661)

    • Limit pandas <2.0.0 for core requirements (GH#1668)

    • Upgrade minimum dask to 2022.11.1 and minimum pandas to 1.4.3 (GH#1671)

Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @ParthivNaresh, @simha104

v0.22.0 March 13, 2023#

  • Enhancements
    • Improved inference for numeric logical types to handle incoming object dtype data (GH#1645)

    • Updated datetime format inference to handle years represented by 2 digits (GH#1632)

    • Updated dependence_dict to handle boolean columns (GH#1652)

  • Changes
    • Pin for jupyter-client to 7.4.9 for documentation (GH#1624)

    • Remove jupyter-client documentation requirement (GH#1627)

    • Separate Makefile command for core requirements, test requirements and dev requirements (GH#1658)

  • Testing Changes
    • Add ruff for linting and replace isort/flake8 (GH#1614)

    • Specify black and ruff config arguments (GH#1620)

    • Add codecov token for unit tests workflow (GH#1630)

    • Add GitHub Actions cache to speed up workflows (GH#1631)

    • Add pull request check for linked issues to CI workflow (GH#1633, GH#1636)

    • Run lint fix on latest dependency update pull requests (GH#1640, GH#1641)

Thanks to the following people for contributing to this release: @bchen1116, @gsheni, @ParthivNaresh

v0.21.2 January 11, 2023#

  • Changes
    • Bump scipy and scikit-learn min versions for compatibility with numpy 1.24.0 (GH#1606)

    • Add is_natural_language method to ColumnSchema object (GH#1610)

    • Changed the transform function for the Boolean logical type to improve runtime (GH#1612)

Thanks to the following people for contributing to this release: @ParthivNaresh, @sbadithe, @thehomebrewnerd

v0.21.1 December 16, 2022#

  • Fixes
    • Fix importlib DeprecationWarning in inference_functions.py (GH#1584)

    • Schema now maintains column order after renaming a column (GH#1594)

    • Fixed logic to not set config during boolean transform (GH#1601)

  • Changes
    • Rename backup_dtype to pyspark_dtype (GH#1593)

    • Removed inference for ["0", "1"], ["0.0", "1.0"], and [0, 1] as Boolean logical types, but maintained forced inference of such values (GH#1600)

Thanks to the following people for contributing to this release: @bchen1116, @sbadithe

v0.21.0 December 1, 2022#

  • Enhancements
    • Improved Boolean and BooleanNullable inference to detect common string representations of boolean values (GH#1549)

    • Added the get_outliers and medcouple_dict functions to WoodworkColumnAccessor so that the medcouple statistic can be used for outlier detection (GH#1547)

  • Fixes
    • Resolved FutureWarning in _get_box_plot_info_for_column (GH#1563)

    • Fixed error message in validate method in logical_types.py (GH#1565)

    • Fixed IntegerNullable inference by checking values are within valid Int64 bounds (GH#1572)

    • Update demo dataset links to point to new endpoint (GH#1570)

    • Fix DivisionByZero error in type_system.py (GH#1571)

    • Fix Categorical dtype inference for PostalCode logical type (GH#1574)

    • Fixed issue where forcing a Boolean logical type on a column of 0.0s and 1.0s caused incorrect transformation (GH#1576)

  • Changes
    • Unpin dask dependency (GH#1561)

    • Changed the sampling strategy for type inference from head to random (GH#1566)

  • Documentation Changes
    • Updated documentation to include the get_outliers and medcouple_dict (GH#1547)

  • Testing Changes
    • Run looking glass performance tests on merge (GH#1567)

Thanks to the following people for contributing to this release: @bchen1116, @gsheni, @ParthivNaresh, @sbadithe, @simha104

Breaking Changes#

  • GH#1549 will automatically infer more values as Boolean or BooleanNullable, including, but not limited to, [0, 1], ['yes', 'no'], and ["True", "False"].

v0.20.0 October 31, 2022#

  • Enhancements
    • Replace use of deprecated append method for dataframes and series with concat method (GH#1533)

  • Fixes
    • Fixed bug relating to dependence calculations to ensure columns exist in dataframe (GH#1534)

    • Small typo fix in select docstring (GH#1544)

    • Fix TypeValidationError message (GH#1557)

    • Set dask version below 2022.10.1 (GH#1558)

Thanks to the following people for contributing to this release: @bchen1116, @sbadithe

v0.19.0 September 27, 2022#

  • Enhancements
    • Added Spearman Correlation to options for dependence calculations (GH#1523)

    • Added ignore_zeros as an argument for box_plot_dict to allow for calculations of outliers without 0 values (GH#1524)

    • Added target_col argument to dependence and dependence_dict to calculate correlations between features and target_col (GH#1531)

  • Fixes
    • Fix datetime pivot point to be set at current year + 10 rather than the default for two-digit years when datetime_format provided (GH#1512)

  • Changes
    • Added ignore_columns as an argument when initializing a dataframe (GH#1504)

    • Remove dask[dataframe] version restriction (GH#1527)

  • Testing Changes
    • Add kickoff for create conda forge pull request from release (GH#1515)

Thanks to the following people for contributing to this release: @bchen1116, @gsheni, @ParthivNaresh, @thehomebrewnerd

v0.18.0 August 31, 2022#

  • Enhancements
    • Updated dependence_dict and mutual_information to drop to drop Categorical columns with a large number of unique values during mutual information calculation, non-dask only. (GH#1501)

  • Fixes
    • Fix applying LatLong.transform to empty dask data (GH#1507)

  • Changes
    • Transition from setup.cfg to pyproject.toml (GH#1506,:pr:1508)

    • Added a check to see if a series dtype has changed prior to using _replace_nans (GH#1502)

  • Testing Changes
    • Update development requirements and use latest for documentation (GH#1499)

Thanks to the following people for contributing to this release: @bchen1116, @gsheni, @jeff-hernandez, @ParthivNaresh, @rwedge

v0.17.2 August 5, 2022#

  • Fixes
    • Updated concat_columns to work with dataframes with mismatched indices or different shapes (GH#1485)

  • Documentation Changes
    • Add instructions to add new users to woodwork feedstock (GH#1483)

  • Testing Changes
    • Add create feedstock PR workflow (GH#1489)

Thanks to the following people for contributing to this release: @chukarsten, @cmancuso, @gsheni,

v0.17.1 July 29, 2022#

  • Testing Changes
    • Allow for manual kickoff for minimum dependency checker (GH#1476)

Thanks to the following people for contributing to this release: @bchen1116, @gsheni

v0.17.0 July 14, 2022#

Warning

This release of Woodwork will not support Python 3.7

  • Enhancements
    • Added ability to null invalid values for Double logical type (GH#1449)

    • Added ability to null invalid values for BooleanNullable logical type (GH#1455)

    • Added ability to null invalid values for IntegerNullable logical type (GH#1456)

    • Added ability to null invalid values for EmailAddress logical type (GH#1457)

    • Added ability to null invalid values for URL logical type (GH#1459)

    • Added ability to null invalid values for PhoneNumber logical type (GH#1460)

    • Added ability to null invalid values for AgeFractional and AgeNullable logical types (GH#1462)

    • Added ability to null invalid values for LatLong logical type (GH#1465)

    • Added ability to null invalid values for PostalCode logical type (US only) (GH#1467)

    • Added smarter inference for IntegerNullable and BooleanNullable types (GH#1458)

  • Fixes
    • Fixed inference of all null string values as Unknown instead of Datetime (GH#1458)

  • Changes
    • Set the minimum acceptable version of pandas to 1.4.0 for woodwork and 1.4.3 for spark add-on (GH#1461)

    • Dropped support for Python 3.7 (GH#1461)

    • Add pre-commit hooks for linting (GH#1470)

Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @ParthivNaresh

v0.16.4 Jun 23, 2022#

  • Fixes
    • Fix concatenation of invalid logical type values (GH#1437)

    • Fix validation for numeric postal codes (GH#1439)

  • Changes
    • Restrict pyspark below 3.3.0 (GH#1450)

  • Documentation Changes
    • Add slack icon to footer in docs (GH#1432)

    • Update contributing.md to add pandoc (GH#1443)

  • Testing Changes
    • Use codecov action v3 (GH#1422)

    • Added tests to test minimum dependencies of minimum dependencies (GH#1440)

    • Add workflow to kickoff EvalML unit tests on commit to main (GH#1424, GH#1426)

    • Rename yml to yaml for GitHub Actions (GH#1428, GH#1429)

Thanks to the following people for contributing to this release: @bchen1116, @gsheni, @jeff-hernandez, @ParthivNaresh

v0.16.3 May 4, 2022#

  • Fixes
    • Fixed col_is_datetime inference function to not infer numeric dtypes as datetime (GH#1413)

  • Changes
    • Delete setup.py, MANIFEST.in and move configuration to pyproject.toml (GH#1409)

  • Documentation Changes
  • Testing Changes
    • Add workflow to kickoff Featuretools unit tests with Woodwork main (GH#1400)

    • Add workflow for testing Woodwork without test dependencies (GH#1414)

Thanks to the following people for contributing to this release: @bchen1116, @gsheni, @ParthivNaresh

v0.16.2 Apr 25, 2022#

  • Fixes
    • Fixed import issues regarding pyarrow and made python-dateutil>=2.8.1 a required dependency (GH#1397)

Thanks to the following people for contributing to this release: @ParthivNaresh

v0.16.1 Apr 25, 2022#

  • Fixes
    • Reverting string[pyarrow] until fix can be found for pandas issue (GH#1391)

Thanks to the following people for contributing to this release: @ParthivNaresh

v0.16.0 Apr 21, 2022#

  • Enhancements
    • Added the ability to provide a callback function to TableAccessor.describe() to get intermediate results (GH#1387)

    • Add pearson_correlation and dependence methods to TableAccessor (GH#1265)

    • Uses string[pyarrow] instead of string dtype to save memory (GH#1360)

    • Added a better error message when dataframe and schema have different columns (GH#1366)

    • Stores timezones in Datetime logical type (GH#1376)

    • Added type inference for phone numbers (GH#1357)

    • Added type inference for zip code (GH#1378)

  • Fixes
  • Changes
    • Change underlying logic of TableAccessor.mutual_information (GH#1265)

    • Added from_disk as a convenience function to deserialize a WW table (GH#1363)

    • Allow attr version in setup.cfg (GH#1361)

    • Raise error if files already exist during serialization (GH#1356)

    • Improve exception handling in col_is_datetime (GH#1365)

    • Store typing info in parquet file header during serialization (GH#1377)

  • Documentation Changes
    • Upgrade nbconvert and remove jinja2 dependency (GH#1362)

    • Add M1 installation instructions to docs and contributing guide (GH#1367)

    • Update README text to Alteryx (GH#1381, GH#1382)

  • Testing Changes
    • Separate testing matrix to speed up GitHub Actions Linux tests for latest dependencies GH#1380

Thanks to the following people for contributing to this release: @bchen1116, @gsheni, @jeff-hernandez, @ParthivNaresh, @rwedge, @thehomebrewnerd

v0.15.0 Mar 24, 2022#

  • Enhancements
    • Added CurrencyCode to logical types (GH#1348)

    • Added Datetime Frequency Inference V2 (GH#1281)

  • Fixes
    • Updated __str__ output for Ordinal logical types (GH#1340)

  • Changes
    • Updated lint check to only run on Python 3.10 (GH#1345)

    • Transition to use pyproject.toml and setup.cfg (moving away from setup.py) (GH#1346)

  • Documentation Changes
    • Update release.md with correct version updating info (GH#1358)

  • Testing Changes
    • Updated scheduled workflows to only run on Alteryx owned repos (GH#1351)

Thanks to the following people for contributing to this release: @bchen1116, @dvreed77, @jeff-hernandez, @ParthivNaresh, @thehomebrewnerd

v0.14.0 Mar 15, 2022#

  • Fixes
    • Preserve custom semantic tags when changing column logical type (GH#1300)

  • Changes
    • Calculate nunique for Unknown columns in _get_describe_dict (GH#1322)

    • Refactor serialization and deserialization for improved modularity (GH#1325)

    • Replace Koalas with the pandas API on Spark (GH#1331)

  • Documentation Changes
    • Update copy and paste button to remove syntax signs (GH#1313)

    • Move LatLong and Ordinal logical type validation logic to LogicalType.validate methods (GH#1315)

    • Add backport release support (GH#1321)

    • Add get_subset_schema to API reference (GH#1335)

  • Testing Changes

Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @rwedge, @tamargrey, @thehomebrewnerd, @mingdavidqi

Breaking Changes#

  • GH#1325: The following serialization functions have been removed from the API: woodwork.serialize.write_dataframe, woodwork.serialize.write_typing_info and woodwork.serialize.write_woodwork_table. Also, the function woodwork.serialize.typing_info_to_dict has been moved to woodwork.serializers.serializer_base.typing_info_to_dict.

v0.13.0 Feb 16, 2022#

Warning

Woodwork may not support Python 3.7 in next non-bugfix release.

  • Enhancements
    • Add validation to EmailAddress logical type (GH#1247)

    • Add validation to URL logical type (GH#1285)

    • Add validation to Age, AgeFractional, and AgeNullable logical types (GH#1289)

  • Fixes
    • Check range length in table stats without producing overflow error (GH#1287)

    • Fixes issue with initializing Woodwork Series with LatLong values (GH#1299)

  • Changes
    • Remove framework for unused woodwork CLI (GH#1288)

    • Add back support for Python 3.7 (GH#1292)

    • Nested statistical utility functions into directory (GH#1295)

  • Documentation Changes
    • Updating contributing doc with PATH and JAVA_HOME instructions (GH#1273)

    • Better install page with new Sphinx extensions for copying and in-line tabs (GH#1280, GH#1282)

    • Update README.md with Alteryx link (GH#1291)

  • Testing Changes
    • Replace mock with unittest.mock (GH#1304)

Thanks to the following people for contributing to this release: @dvreed77, @gsheni, @jeff-hernandez, @rwedge, @tamargrey, @thehomebrewnerd

v0.12.0 Jan 27, 2022#

  • Enhancements
    • Add Slack link to GitHub issue creation templates (GH#1242)

  • Fixes
    • Fixed issue with tuples being incorrectly inferred as EmailAddress (GH#1253)

    • Set high and low bounds to the max and min values if no outliers are present in box_plot_dict (GH#1269)

  • Changes
    • Prevent setting index that contains null values (GH#1239)

    • Allow tuple NaN LatLong values (GH#1255)

    • Update ipython to 7.31.1 (GH#1258)

    • Temporarily restrict pandas and koalas max versions (GH#1261)

    • Update to drop Python 3.7 support and add support for pandas version 1.4.0 (GH#1264)

  • Testing Changes
    • Change auto approve workflow to use PR number (GH#1240, GH#1241)

    • Update auto approve workflow to delete branch and change on trigger (GH#1251)

    • Fix permissions issue with S3 deserialization test (GH#1238)

Thanks to the following people for contributing to this release: @dvreed77, @gsheni, @jeff-hernandez, @rwedge, @tamargrey, @thehomebrewnerd

v0.11.2 Jan 28, 2022#

  • Fixes
    • Set high and low bounds to the max and min values if no outliers are present in box_plot_dict (backport of GH#1269)

Thanks to the following people for contributing to this release: @tamargrey

Note#

  • The pandas version for Koalas has been restricted, and a change was made to a pandas replace call to account for the recent pandas 1.4.0 release.

v0.11.1 Jan 4, 2022#

  • Changes
    • Update inference process to only check for NaturalLanguage if no other type matches are found first (GH#1234)

  • Documentation Changes
    • Updating contributing doc with Spark installation instructions (GH#1232)

  • Testing Changes

Thanks to the following people for contributing to this release: @gsheni, @thehomebrewnerd, @willsmithorg

v0.11.0 Dec 22, 2021#

  • Enhancements
    • Add type inference for natural language (GH#1210)

  • Changes
    • Make public method get_subset_schema (GH#1218)

Thanks to the following people for contributing to this release: @jeff-hernandez, @thehomebrewnerd, @tuethan1999

v0.10.0 Nov 30, 2021#

  • Enhancements
    • Allow frequency inference on temporal (Datetime, Timedelta) columns of Woodwork DataFrame (GH#1202)

    • Update describe_dict to compute top_values for double columns that contain only integer values (GH#1206)

  • Changes
    • Return histogram bins as a list of floats instead of a pandas.Interval object (GH#1207)

Thanks to the following people for contributing to this release: @tamargrey, @thehomebrewnerd

Breaking Changes#

  • :pr:1207: The behavior of describe_dict has changed when using extra_stats=True. Previously, the histogram bins were returned as pandas.Interval objects. This has been updated so that the histogram bins are now represented as a two-element list of floats with the first element being the left edge of the bin and the second element being the right edge.

v0.9.1 Nov 19, 2021#

  • Fixes
    • Fix bug that causes mutual_information to fail with certain index types (GH#1199)

  • Changes
    • Update pip to 21.3.1 for test requirements (GH#1196)

  • Documentation Changes
    • Update install page with updated minimum optional dependencies (GH#1193)

Thanks to the following people for contributing to this release: @gsheni, @thehomebrewnerd

v0.9.0 Nov 11, 2021#

  • Enhancements
    • Added read_file parameter for replacing empty string values with NaN values (GH#1161)

  • Fixes
    • Set a maximum version for pyspark until we understand why GH#1169 failed (GH#1179)

    • Require newer dask version (GH#1180)

  • Changes
    • Make box plot low/high indices/values optional to return in box_plot_dict (GH#1184)

  • Documentation Changes
    • Update docs dependencies (GH#1176)

  • Testing Changes
    • Add black linting package and remove autopep8 (GH#1164, GH#1183)

    • Updated notebook standardizer to standardize python versions (GH#1166)

Thanks to the following people for contributing to this release: @bchen1116, @davesque, @gsheni, @rwedge, @tamargrey, @thehomebrewnerd

v0.8.2 Oct 12, 2021#

  • Fixes
    • Fixed an issue when inferring the format of datetime strings with day of week or meridiem placeholders (GH#1158)

    • Implements change in Datetime.transform to prevent initialization failure in some cases (GH#1162)

  • Testing Changes
    • Update reviewers for minimum and latest dependency checkers (GH#1150)

    • Added notebook standardizer to remove executed outputs (GH#1153)

Thanks to the following people for contributing to this release: @bchen1116, @davesque, @jeff-hernandez, @thehomebrewnerd

v0.8.1 Sep 16, 2021#

  • Changes
    • Update Datetime.transform to use default nrows value when calling _infer_datetime_format (GH#1137)

  • Documentation Changes
    • Hide spark config in Using Dask and Koalas Guide (GH#1139)

Thanks to the following people for contributing to this release: @jeff-hernandez, @simha104, @thehomebrewnerd

v0.8.0 Sep 9, 2021#

  • Enhancements
    • Add support for automatically inferring the URL and IPAddress logical types (GH#1122, GH#1124)

    • Add get_valid_mi_columns method to list columns that have valid logical types for mutual information calculation (GH#1129)

    • Add attribute to check if column has a nullable logical type (GH#1127)

  • Changes
    • Update get_invalid_schema_message to improve performance (GH#1132)

  • Documentation Changes
    • Fix typo in the “Get Started” documentation (GH#1126)

    • Clean up the logical types guide (GH#1134)

Thanks to the following people for contributing to this release: @ajaypallekonda, @davesque, @jeff-hernandez, @thehomebrewnerd

v0.7.1 Aug 25, 2021#

  • Fixes
    • Validate schema’s index if being used in partial schema init (GH#1115)

    • Allow falsy index, time index, and name values to be set along with partial schema at init (GH#1115)

Thanks to the following people for contributing to this release: @tamargrey

v0.7.0 Aug 25, 2021#

  • Enhancements
    • Add 'passthrough' and 'ignore' to tags in list_semantic_tags (GH#1094)

    • Add initialize with partial table schema (GH#1100)

    • Apply ordering specified by the Ordinal logical type to underlying series (GH#1097)

    • Add AgeFractional logical type (GH#1112)

Thanks to the following people for contributing to this release: @davesque, @jeff-hernandez, @tamargrey, @tuethan1999

Breaking Changes#

  • :pr:1100: The behavior for init has changed. A full schema is a schema that contains all of the columns of the dataframe it describes whereas a partial schema only contains a subset. A full schema will also require that the schema is valid without having to make any changes to the DataFrame. Before, only a full schema was permitted by the init method so passing a partial schema would error. Additionally, any parameters like logical_types would be ignored if passing in a schema. Now, passing a partial schema to the init method calls the init_with_partial_schema method instead of throwing an error. Information from keyword arguments will override information from the partial schema. For example, if column a has the Integer Logical Type in the partial schema, it’s possible to use the logical_type argument to reinfer it’s logical type by passing {'a': None} or force a type by passing in {'a': Double}. These changes mean that Woodwork init is less restrictive. If no type inference takes place and no changes are required of the DataFrame at initialization, init_with_full_schema should be used instead of init. init_with_full_schema maintains the same functionality as when a schema was passed to the old init.

v0.6.0 Aug 4, 2021#

  • Fixes
    • Fix bug in _infer_datetime_format with all np.nan input (GH#1089)

  • Changes
    • The criteria for categorical type inference have changed (GH#1065)

    • The meaning of both the categorical_threshold and numeric_categorical_threshold settings have changed (GH#1065)

    • Make sampling for type inference more consistent (GH#1083)

    • Accessor logic checking if Woodwork has been initialized moved to decorator (GH#1093)

  • Documentation Changes
    • Fix some release notes that ended up under the wrong release (GH#1082)

    • Add BooleanNullable and IntegerNullable types to the docs (GH#1085)

    • Add guide for saving and loading Woodwork DataFrames (GH#1066)

    • Add in-depth guide on logical types and semantic tags (GH#1086)

  • Testing Changes
    • Add additional reviewers to minimum and latest dependency checkers (GH#1070, GH#1073, GH#1077)

    • Update the sample_df fixture to have more logical_type coverage (GH#1058)

Thanks to the following people for contributing to this release: @davesque, @gsheni, @jeff-hernandez, @rwedge, @tamargrey, @thehomebrewnerd, @tuethan1999

Breaking Changes#

  • GH#1065: The criteria for categorical type inference have changed. Relatedly, the meaning of both the categorical_threshold and numeric_categorical_threshold settings have changed. Now, a categorical match is signaled when a series either has the “categorical” pandas dtype or if the ratio of unique value count (nan excluded) and total value count (nan also excluded) is below or equal to some fraction. The value used for this fraction is set by the categorical_threshold setting which now has a default value of 0.2. If a fraction is set for the numeric_categorical_threshold setting, then series with either a float or integer dtype may be inferred as categorical by applying the same logic described above with the numeric_categorical_threshold fraction. Otherwise, the numeric_categorical_threshold setting defaults to None which indicates that series with a numerical type should not be inferred as categorical. Users who have overridden either the categorical_threshold or numeric_categorical_threshold settings will need to adjust their settings accordingly.

  • GH#1083: The process of sampling series for logical type inference was updated to be more consistent. Before, initial sampling for inference differed depending on collection type (pandas, dask, or koalas). Also, further randomized subsampling was performed in some cases during categorical inference and in every case during email inference regardless of collection type. Overall, the way sampling was done was inconsistent and unpredictable. Now, the first 100,000 records of a column are sampled for logical type inference regardless of collection type although only records from the first partition of a dask dataset will be used. Subsampling performed by the inference functions of individual types has been removed. The effect of these changes is that inferred types may now be different although in many cases they will be more correct.

v0.5.1 Jul 22, 2021#

  • Enhancements
    • Store inferred datetime format on Datetime logical type instance (GH#1025)

    • Add support for automatically inferring the EmailAddress logical type (GH#1047)

    • Add feature origin attribute to schema (GH#1056)

    • Add ability to calculate outliers and the statistical info required for box and whisker plots to WoodworkColumnAccessor (GH#1048)

    • Add ability to change config settings in a with block with ww.config.with_options (GH#1062)

  • Fixes
    • Raises warning and removes tags when user adds a column with index tags to DataFrame (GH#1035)

  • Changes
    • Entirely null columns are now inferred as the Unknown logical type (GH#1043)

    • Add helper functions that check for whether an object is a koalas/dask series or dataframe (GH#1055)

    • TableAccessor.select method will now maintain dataframe column ordering in TableSchema columns (GH#1052)

  • Documentation Changes
    • Add supported types to metadata docstring (GH#1049)

Thanks to the following people for contributing to this release: @davesque, @frances-h, @jeff-hernandez, @simha104, @tamargrey, @thehomebrewnerd

v0.5.0 Jul 7, 2021#

  • Enhancements
    • Add support for numpy array inputs to Woodwork (GH#1023)

    • Add support for pandas.api.extensions.ExtensionArray inputs to Woodwork (GH#1026)

  • Fixes
    • Add input validation to ww.init_series (GH#1015)

  • Changes
    • Remove lines in LogicalType.transform that raise error if dtype conflicts (GH#1012)

    • Add infer_datetime_format param to speed up to_datetime calls (GH#1016)

    • The default logical type is now the Unknown type instead of the NaturalLanguage type (GH#992)

    • Add pandas 1.3.0 compatibility (GH#987)

Thanks to the following people for contributing to this release: @jeff-hernandez, @simha104, @tamargrey, @thehomebrewnerd, @tuethan1999

Breaking Changes#

  • The default logical type is now the Unknown type instead of the NaturalLanguage type. The global config natural_language_threshold has been renamed to categorical_threshold.

v0.4.2 Jun 23, 2021#

  • Enhancements
    • Pass additional progress information in callback functions (GH#979)

    • Add the ability to generate optional extra stats with DataFrame.ww.describe_dict (GH#988)

    • Add option to read and write orc files (GH#997)

    • Retain schema when calling series.ww.to_frame() (GH#1004)

  • Fixes
    • Raise type conversion error in Datetime logical type (GH#1001)

    • Try collections.abc to avoid deprecation warning (GH#1010)

  • Changes
    • Remove make_index parameter from DataFrame.ww.init (GH#1000)

    • Remove version restriction for dask requirements (GH#998)

  • Documentation Changes
    • Add instructions for installing the update checker (GH#993)

    • Disable pdf format with documentation build (GH#1002)

    • Silence deprecation warnings in documentation build (GH#1008)

    • Temporarily remove update checker to fix docs warnings (GH#1011)

  • Testing Changes

Thanks to the following people for contributing to this release: @frances-h, @gsheni, @jeff-hernandez, @tamargrey, @thehomebrewnerd, @tuethan1999

Breaking Changes#

  • Progress callback functions parameters have changed and progress is now being reported in the units specified by the unit of measurement parameter instead of percentage of total. Progress callback functions now are expected to accept the following five parameters:

    • progress increment since last call

    • progress units complete so far

    • total units to complete

    • the progress unit of measurement

    • time elapsed since start of calculation

  • DataFrame.ww.init no longer accepts the make_index parameter

v0.4.1 Jun 9, 2021#

  • Enhancements
    • Add concat_columns util function to concatenate multiple Woodwork objects into one, retaining typing information (GH#932)

    • Add option to pass progress callback function to mutual information functions (GH#958)

    • Add optional automatic update checker (GH#959, GH#970)

  • Fixes
    • Fix issue related to serialization/deserialization of data with whitespace and newline characters (GH#957)

    • Update to allow initializing a ColumnSchema object with an Ordinal logical type without order values (GH#972)

  • Changes
    • Change write_dataframe to only copy dataframe if it contains LatLong (GH#955)

  • Testing Changes
    • Fix bug in test_list_logical_types_default (GH#954)

    • Update minimum unit tests to run on all pull requests (GH#952)

    • Pass token to authorize uploading of codecov reports (GH#969)

Thanks to the following people for contributing to this release: @frances-h, @gsheni, @tamargrey, @thehomebrewnerd

v0.4.0 May 26, 2021#

  • Enhancements
    • Add option to return TableSchema instead of DataFrame from table accessor select method (GH#916)

    • Add option to read and write arrow/feather files (GH#948)

    • Add dropping and renaming columns inplace (GH#920)

    • Add option to pass progress callback function to mutual information functions (GH#943)

  • Fixes
    • Fix bug when setting table name and metadata through accessor (GH#942)

    • Fix bug in which the dtype of category values were not restored properly on deserialization (GH#949)

  • Changes
    • Add logical type method to transform data (GH#915)

  • Testing Changes
    • Update when minimum unit tests will run to include minimum text files (GH#917)

    • Create separate workflows for each CI job (GH#919)

Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @thehomebrewnerd, @tuethan1999

v0.3.1 May 12, 2021#

Warning

This Woodwork release uses a weak reference for maintaining a reference from the accessor to the DataFrame. Because of this, chaining a Woodwork call onto another call that creates a new DataFrame or Series object can be problematic.

Instead of calling pd.DataFrame({'id':[1, 2, 3]}).ww.init(), first store the DataFrame in a new variable and then initialize Woodwork:

df = pd.DataFrame({'id':[1, 2, 3]})
df.ww.init()
  • Enhancements
    • Add deep parameter to Woodwork Accessor and Schema equality checks (GH#889)

    • Add support for reading from parquet files to woodwork.read_file (GH#909)

  • Changes
    • Remove command line functions for list logical and semantic tags (GH#891)

    • Keep index and time index tags for single column when selecting from a table (GH#888)

    • Update accessors to store weak reference to data (GH#894)

  • Documentation Changes
    • Update nbsphinx version to fix docs build issue (GH#911, GH#913)

  • Testing Changes
    • Use Minimum Dependency Generator GitHub Action and remove tools folder (GH#897)

    • Move all latest and minimum dependencies into 1 folder (GH#912)

Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @tamargrey, @thehomebrewnerd

Breaking Changes#

  • The command line functions python -m woodwork list-logical-types and python -m woodwork list-semantic-tags no longer exist. Please call the underlying Python functions ww.list_logical_types() and ww.list_semantic_tags().

v0.3.0 May 3, 2021#

  • Enhancements
    • Add is_schema_valid and get_invalid_schema_message functions for checking schema validity (GH#834)

    • Add logical type for Age and AgeNullable (GH#849)

    • Add logical type for Address (GH#858)

    • Add generic to_disk function to save Woodwork schema and data (GH#872)

    • Add generic read_file function to read file as Woodwork DataFrame (GH#878)

  • Fixes
    • Raise error when a column is set as the index and time index (GH#859)

    • Allow NaNs in index for schema validation check (GH#862)

    • Fix bug where invalid casting to Boolean would not raise error (GH#863)

  • Changes
    • Consistently use ColumnNotPresentError for mismatches between user input and dataframe/schema columns (GH#837)

    • Raise custom WoodworkNotInitError when accessing Woodwork attributes before initialization (GH#838)

    • Remove check requiring Ordinal instance for initializing a ColumnSchema object (GH#870)

    • Increase koalas min version to 1.8.0 (GH#885)

  • Documentation Changes
    • Improve formatting of release notes (GH#874)

  • Testing Changes
    • Remove unnecessary argument in codecov upload job (GH#853)

    • Change from GitHub Token to regenerated GitHub PAT dependency checkers (GH#855)

    • Update README.md with non-nullable dtypes in code example (GH#856)

Thanks to the following people for contributing to this release: @frances-h, @gsheni, @jeff-hernandez, @rwedge, @tamargrey, @thehomebrewnerd

Breaking Changes#

  • Woodwork tables can no longer be saved using to disk df.ww.to_csv, df.ww.to_pickle, or df.ww.to_parquet. Use df.ww.to_disk instead.

  • The read_csv function has been replaced by read_file.

v0.2.0 Apr 20, 2021#

Warning

This Woodwork release does not support Python 3.6

  • Enhancements
    • Add validation control to WoodworkTableAccessor (GH#736)

    • Store make_index value on WoodworkTableAccessor (GH#780)

    • Add optional exclude parameter to WoodworkTableAccessor select method (GH#783)

    • Add validation control to deserialize.read_woodwork_table and ww.read_csv (GH#788)

    • Add WoodworkColumnAccessor.schema and handle copying column schema (GH#799)

    • Allow initializing a WoodworkColumnAccessor with a ColumnSchema (GH#814)

    • Add __repr__ to ColumnSchema (GH#817)

    • Add BooleanNullable and IntegerNullable logical types (GH#830)

    • Add validation control to WoodworkColumnAccessor (GH#833)

  • Changes
    • Rename FullName logical type to PersonFullName (GH#740)

    • Rename ZIPCode logical type to PostalCode (GH#741)

    • Fix issue with smart-open version 5.0.0 (GH#750, GH#758)

    • Update minimum scikit-learn version to 0.22 (GH#763)

    • Drop support for Python version 3.6 (GH#768)

    • Remove ColumnNameMismatchWarning (GH#777)

    • get_column_dict does not use standard tags by default (GH#782)

    • Make logical_type and name params to _get_column_dict optional (GH#786)

    • Rename Schema object and files to match new table-column schema structure (GH#789)

    • Store column typing information in a ColumnSchema object instead of a dictionary (GH#791)

    • TableSchema does not use standard tags by default (GH#806)

    • Store use_standard_tags on the ColumnSchema instead of the TableSchema (GH#809)

    • Move functions in column_schema.py to be methods on ColumnSchema (GH#829)

  • Documentation Changes
  • Testing Changes
    • Add unit tests against minimum dependencies for python 3.6 on PRs and main (GH#743, GH#753, GH#763)

    • Update spark config for test fixtures (GH#787)

    • Separate latest unit tests into pandas, dask, koalas (GH#813)

    • Update latest dependency checker to generate separate core, koalas, and dask dependencies (GH#815, GH#825)

    • Ignore latest dependency branch when checking for updates to the release notes (GH#827)

    • Change from GitHub PAT to auto generated GitHub Token for dependency checker (GH#831)

    • Expand ColumnSchema semantic tag testing coverage and null logical_type testing coverage (GH#832)

Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @rwedge, @tamargrey, @thehomebrewnerd

Breaking Changes#

  • The ZIPCode logical type has been renamed to PostalCode

  • The FullName logical type has been renamed to PersonFullName

  • The Schema object has been renamed to TableSchema

  • With the ColumnSchema object, typing information for a column can no longer be accessed with df.ww.columns[col_name]['logical_type']. Instead use df.ww.columns[col_name].logical_type.

  • The Boolean and Integer logical types will no longer work with data that contains null values. The new BooleanNullable and IntegerNullable logical types should be used if null values are present.

v0.1.0 Mar 22, 2021#

  • Enhancements
    • Implement Schema and Accessor API (GH#497)

    • Add Schema class that holds typing info (GH#499)

    • Add WoodworkTableAccessor class that performs type inference and stores Schema (GH#514)

    • Allow initializing Accessor schema with a valid Schema object (GH#522)

    • Add ability to read in a csv and create a DataFrame with an initialized Woodwork Schema (GH#534)

    • Add ability to call pandas methods from Accessor (GH#538, GH#589)

    • Add helpers for checking if a column is one of Boolean, Datetime, numeric, or categorical (GH#553)

    • Add ability to load demo retail dataset with a Woodwork Accessor (GH#556)

    • Add select to WoodworkTableAccessor (GH#548)

    • Add mutual_information to WoodworkTableAccessor (GH#571)

    • Add WoodworkColumnAccessor class (GH#562)

    • Add semantic tag update methods to column accessor (GH#573)

    • Add describe and describe_dict to WoodworkTableAccessor (GH#579)

    • Add init_series util function for initializing a series with dtype change (GH#581)

    • Add set_logical_type method to WoodworkColumnAccessor (GH#590)

    • Add semantic tag update methods to table schema (GH#591)

    • Add warning if additional parameters are passed along with schema (GH#593)

    • Better warning when accessing column properties before init (GH#596)

    • Update column accessor to work with LatLong columns (GH#598)

    • Add set_index to WoodworkTableAccessor (GH#603)

    • Implement loc and iloc for WoodworkColumnAccessor (GH#613)

    • Add set_time_index to WoodworkTableAccessor (GH#612)

    • Implement loc and iloc for WoodworkTableAccessor (GH#618)

    • Allow updating logical types with set_types and make relevant DataFrame changes (GH#619)

    • Allow serialization of WoodworkColumnAccessor to csv, pickle, and parquet (GH#624)

    • Add DaskColumnAccessor (GH#625)

    • Allow deserialization from csv, pickle, and parquet to Woodwork table (GH#626)

    • Add value_counts to WoodworkTableAccessor (GH#632)

    • Add KoalasColumnAccessor (GH#634)

    • Add pop to WoodworkTableAccessor (GH#636)

    • Add drop to WoodworkTableAccessor (GH#640)

    • Add rename to WoodworkTableAccessor (GH#646)

    • Add DaskTableAccessor (GH#648)

    • Add Schema properties to WoodworkTableAccessor (GH#651)

    • Add KoalasTableAccessor (GH#652)

    • Adds __getitem__ to WoodworkTableAccessor (GH#633)

    • Update Koalas min version and add support for more new pandas dtypes with Koalas (GH#678)

    • Adds __setitem__ to WoodworkTableAccessor (GH#669)

  • Fixes
    • Create new Schema object when performing pandas operation on Accessors (GH#595)

    • Fix bug in _reset_semantic_tags causing columns to share same semantic tags set (GH#666)

    • Maintain column order in DataFrame and Woodwork repr (GH#677)

  • Changes
    • Move mutual information logic to statistics utils file (GH#584)

    • Bump min Koalas version to 1.4.0 (GH#638)

    • Preserve pandas underlying index when not creating a Woodwork index (GH#664)

    • Restrict Koalas version to <1.7.0 due to breaking changes (GH#674)

    • Clean up dtype usage across Woodwork (GH#682)

    • Improve error when calling accessor properties or methods before init (GH#683)

    • Remove dtype from Schema dictionary (GH#685)

    • Add include_index param and allow unique columns in Accessor mutual information (GH#699)

    • Include DataFrame equality and use_standard_tags in WoodworkTableAccessor equality check (GH#700)

    • Remove DataTable and DataColumn classes to migrate towards the accessor approach (GH#713)

    • Change sample_series dtype to not need conversion and remove convert_series util (GH#720)

    • Rename Accessor methods since DataTable has been removed (GH#723)

  • Documentation Changes
    • Update README.md and Get Started guide to use accessor (GH#655, GH#717)

    • Update Understanding Types and Tags guide to use accessor (GH#657)

    • Update docstrings and API Reference page (GH#660)

    • Update statistical insights guide to use accessor (GH#693)

    • Update Customizing Type Inference guide to use accessor (GH#696)

    • Update Dask and Koalas guide to use accessor (GH#701)

    • Update index notebook and install guide to use accessor (GH#715)

    • Add section to documentation about schema validity (GH#729)

    • Update README.md and Get Started guide to use pd.read_csv (GH#730)

    • Make small fixes to documentation formatting (GH#731)

  • Testing Changes
    • Add tests to Accessor/Schema that weren’t previously covered (GH#712, GH#716)

    • Update release branch name in notes update check (GH#719)

Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @johnbridstrup, @tamargrey, @thehomebrewnerd

Breaking Changes#

  • The DataTable and DataColumn classes have been removed and replaced by new WoodworkTableAccessor and WoodworkColumnAccessor classes which are used through the ww namespace available on DataFrames after importing Woodwork.

v0.0.11 Mar 15, 2021#

  • Changes
    • Restrict Koalas version to <1.7.0 due to breaking changes (GH#674)

    • Include unique columns in mutual information calculations (GH#687)

    • Add parameter to include index column in mutual information calculations (GH#692)

  • Documentation Changes
    • Update to remove warning message from statistical insights guide (GH#690)

  • Testing Changes
    • Update branch reference in tests to run on main (GH#641)

    • Make release notes updated check separate from unit tests (GH#642)

    • Update release branch naming instructions (GH#644)

Thanks to the following people for contributing to this release: @gsheni, @tamargrey, @thehomebrewnerd

v0.0.10 Feb 25, 2021#

  • Changes
    • Avoid calculating mutualinfo for non-unique columns (GH#563)

    • Preserve underlying DataFrame index if index column is not specified (GH#588)

    • Add blank issue template for creating issues (GH#630)

  • Testing Changes
    • Update branch reference in tests workflow (GH#552, GH#601)

    • Fixed text on back arrow on install page (GH#564)

    • Refactor test_datatable.py (GH#574)

Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @johnbridstrup, @tamargrey

v0.0.9 Feb 5, 2021#

  • Enhancements
    • Add Python 3.9 support without Koalas testing (GH#511)

    • Add get_valid_mi_types function to list LogicalTypes valid for mutual information calculation (GH#517)

  • Fixes
    • Handle missing values in Datetime columns when calculating mutual information (GH#516)

    • Support numpy 1.20.0 by restricting version for koalas and changing serialization error message (GH#532)

    • Move Koalas option setting to DataTable init instead of import (GH#543)

  • Documentation Changes
    • Add Alteryx OSS Twitter link (GH#519)

    • Update logo and add new favicon (GH#521)

    • Multiple improvements to Getting Started page and guides (GH#527)

    • Clean up API Reference and docstrings (GH#536)

    • Added Open Graph for Twitter and Facebook (GH#544)

Thanks to the following people for contributing to this release: @gsheni, @tamargrey, @thehomebrewnerd

v0.0.8 Jan 25, 2021#

  • Enhancements
    • Add DataTable.df property for accessing the underling DataFrame (GH#470)

    • Set index of underlying DataFrame to match DataTable index (GH#464)

  • Fixes
    • Sort underlying series when sorting dataframe (GH#468)

    • Allow setting indices to current index without side effects (GH#474)

  • Changes
    • Fix release document with Github Actions link for CI (GH#462)

    • Don’t allow registered LogicalTypes with the same name (GH#477)

    • Move str_to_logical_type to TypeSystem class (GH#482)

    • Remove pyarrow from core dependencies (GH#508)

Thanks to the following people for contributing to this release: @gsheni, @tamargrey, @thehomebrewnerd

v0.0.7 Dec 14, 2020#

  • Enhancements
    • Allow for user-defined logical types and inference functions in TypeSystem object (GH#424)

    • Add __repr__ to DataTable (GH#425)

    • Allow initializing DataColumn with numpy array (GH#430)

    • Add drop to DataTable (GH#434)

    • Migrate CI tests to Github Actions (GH#417, GH#441, GH#451)

    • Add metadata to DataColumn for user-defined metadata (GH#447)

  • Fixes
    • Update DataColumn name when using setitem on column with no name (GH#426)

    • Don’t allow pickle serialization for Koalas DataFrames (GH#432)

    • Check DataTable metadata in equality check (GH#449)

    • Propagate all attributes of DataTable in _new_dt_including (GH#454)

  • Changes
    • Update links to use alteryx org Github URL (GH#423)

    • Support column names of any type allowed by the underlying DataFrame (GH#442)

    • Use object dtype for LatLong columns for easy access to latitude and longitude values (GH#414)

    • Restrict dask version to prevent 2020.12.0 release from being installed (GH#453)

    • Lower minimum requirement for numpy to 1.15.4, and set pandas minimum requirement 1.1.1 (GH#459)

  • Testing Changes
    • Fix missing test coverage (GH#436)

Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @tamargrey, @thehomebrewnerd

v0.0.6 Nov 30, 2020#

  • Enhancements
    • Add support for creating DataTable from Koalas DataFrame (GH#327)

    • Add ability to initialize DataTable with numpy array (GH#367)

    • Add describe_dict method to DataTable (GH#405)

    • Add mutual_information_dict method to DataTable (GH#404)

    • Add metadata to DataTable for user-defined metadata (GH#392)

    • Add update_dataframe method to DataTable to update underlying DataFrame (GH#407)

    • Sort dataframe if time_index is specified, bypass sorting with already_sorted parameter. (GH#410)

    • Add description attribute to DataColumn (GH#416)

    • Implement DataColumn.__len__ and DataTable.__len__ (GH#415)

  • Fixes
    • Rename data_column.py datacolumn.py (GH#386)

    • Rename data_table.py datatable.py (GH#387)

    • Rename get_mutual_information mutual_information (GH#390)

  • Changes
    • Lower moto test requirement for serialization/deserialization (GH#376)

    • Make Koalas an optional dependency installable with woodwork[koalas] (GH#378)

    • Remove WholeNumber LogicalType from Woodwork (GH#380)

    • Updates to LogicalTypes to support Koalas 1.4.0 (GH#393)

    • Replace set_logical_types and set_semantic_tags with just set_types (GH#379)

    • Remove copy_dataframe parameter from DataTable initialization (GH#398)

    • Implement DataTable.__sizeof__ to return size of the underlying dataframe (GH#401)

    • Include Datetime columns in mutual info calculation (GH#399)

    • Maintain column order on DataTable operations (GH#406)

  • Testing Changes
    • Add pyarrow, dask, and koalas to automated dependency checks (GH#388)

    • Use new version of pull request Github Action (GH#394)

    • Improve parameterization for test_datatable_equality (GH#409)

Thanks to the following people for contributing to this release: @ctduffy, @gsheni, @tamargrey, @thehomebrewnerd

Breaking Changes#

  • The DataTable.set_semantic_tags method was removed. DataTable.set_types can be used instead.

  • The DataTable.set_logical_types method was removed. DataTable.set_types can be used instead.

  • WholeNumber was removed from LogicalTypes. Columns that were previously inferred as WholeNumber will now be inferred as Integer.

  • The DataTable.get_mutual_information was renamed to DataTable.mutual_information.

  • The copy_dataframe parameter was removed from DataTable initialization.

v0.0.5 Nov 11, 2020#

  • Enhancements
    • Add __eq__ to DataTable and DataColumn and update LogicalType equality (GH#318)

    • Add value_counts() method to DataTable (GH#342)

    • Support serialization and deserialization of DataTables via csv, pickle, or parquet (GH#293)

    • Add shape property to DataTable and DataColumn (GH#358)

    • Add iloc method to DataTable and DataColumn (GH#365)

    • Add numeric_categorical_threshold config value to allow inferring numeric columns as Categorical (GH#363)

    • Add rename method to DataTable (GH#367)

  • Fixes
    • Catch non numeric time index at validation (GH#332)

  • Changes
    • Support logical type inference from a Dask DataFrame (GH#248)

    • Fix validation checks and make_index to work with Dask DataFrames (GH#260)

    • Skip validation of Ordinal order values for Dask DataFrames (GH#270)

    • Improve support for datetimes with Dask input (GH#286)

    • Update DataTable.describe to work with Dask input (GH#296)

    • Update DataTable.get_mutual_information to work with Dask input (GH#300)

    • Modify to_pandas function to return DataFrame with correct index (GH#281)

    • Rename DataColumn.to_pandas method to DataColumn.to_series (GH#311)

    • Rename DataTable.to_pandas method to DataTable.to_dataframe (GH#319)

    • Remove UserWarning when no matching columns found (GH#325)

    • Remove copy parameter from DataTable.to_dataframe and DataColumn.to_series (GH#338)

    • Allow pandas ExtensionArrays as inputs to DataColumn (GH#343)

    • Move warnings to a separate exceptions file and call via UserWarning subclasses (GH#348)

    • Make Dask an optional dependency installable with woodwork[dask] (GH#357)

  • Documentation Changes
    • Create a guide for using Woodwork with Dask (GH#304)

    • Add conda install instructions (GH#305, GH#309)

    • Fix README.md badge with correct link (GH#314)

    • Simplify issue templates to make them easier to use (GH#339)

    • Remove extra output cell in Start notebook (GH#341)

  • Testing Changes
    • Parameterize numeric time index tests (GH#288)

    • Add DockerHub credentials to CI testing environment (GH#326)

    • Fix removing files for serialization test (GH#350)

Thanks to the following people for contributing to this release: @ctduffy, @gsheni, @tamargrey, @thehomebrewnerd

Breaking Changes#

  • The DataColumn.to_pandas method was renamed to DataColumn.to_series.

  • The DataTable.to_pandas method was renamed to DataTable.to_dataframe.

  • copy is no longer a parameter of DataTable.to_dataframe or DataColumn.to_series.

v0.0.4 Oct 21, 2020#

  • Enhancements
    • Add optional include parameter for DataTable.describe() to filter results (GH#228)

    • Add make_index parameter to DataTable.__init__ to enable optional creation of a new index column (GH#238)

    • Add support for setting ranking order on columns with Ordinal logical type (GH#240)

    • Add list_semantic_tags function and CLI to get dataframe of woodwork semantic_tags (GH#244)

    • Add support for numeric time index on DataTable (GH#267)

    • Add pop method to DataTable (GH#289)

    • Add entry point to setup.py to run CLI commands (GH#285)

  • Fixes
    • Allow numeric datetime time indices (GH#282)

  • Changes
    • Remove redundant methods DataTable.select_ltypes and DataTable.select_semantic_tags (GH#239)

    • Make results of get_mutual_information more clear by sorting and removing self calculation (GH#247)

    • Lower minimum scikit-learn version to 0.21.3 (GH#297)

  • Documentation Changes
    • Add guide for dt.describe and dt.get_mutual_information (GH#245)

    • Update README.md with documentation link (GH#261)

    • Add footer to doc pages with Alteryx Open Source (GH#258)

    • Add types and tags one-sentence definitions to Understanding Types and Tags guide (GH#271)

    • Add issue and pull request templates (GH#280, GH#284)

  • Testing Changes
    • Add automated process to check latest dependencies. (GH#268)

    • Add test for setting a time index with specified string logical type (GH#279)

Thanks to the following people for contributing to this release: @ctduffy, @gsheni, @tamargrey, @thehomebrewnerd

v0.0.3 Oct 9, 2020#

  • Enhancements
    • Implement setitem on DataTable to create/overwrite an existing DataColumn (GH#165)

    • Add to_pandas method to DataColumn to access the underlying series (GH#169)

    • Add list_logical_types function and CLI to get dataframe of woodwork LogicalTypes (GH#172)

    • Add describe method to DataTable to generate statistics for the underlying data (GH#181)

    • Add optional return_dataframe parameter to load_retail to return either DataFrame or DataTable (GH#189)

    • Add get_mutual_information method to DataTable to generate mutual information between columns (GH#203)

    • Add read_csv function to create DataTable directly from CSV file (GH#222)

  • Fixes
    • Fix bug causing incorrect values for quartiles in DataTable.describe method (GH#187)

    • Fix bug in DataTable.describe that could cause an error if certain semantic tags were applied improperly (GH#190)

    • Fix bug with instantiated LogicalTypes breaking when used with issubclass (GH#231)

  • Changes
    • Remove unnecessary add_standard_tags attribute from DataTable (GH#171)

    • Remove standard tags from index column and do not return stats for index column from DataTable.describe (GH#196)

    • Update DataColumn.set_semantic_tags and DataColumn.add_semantic_tags to return new objects (GH#205)

    • Update various DataTable methods to return new objects rather than modifying in place (GH#210)

    • Move datetime_format to Datetime LogicalType (GH#216)

    • Do not calculate mutual info with index column in DataTable.get_mutual_information (GH#221)

    • Move setting of underlying physical types from DataTable to DataColumn (GH#233)

  • Documentation Changes
    • Remove unused code from sphinx conf.py, update with Github URL(GH#160, GH#163)

    • Update README and docs with new Woodwork logo, with better code snippets (GH#161, GH#159)

    • Add DataTable and DataColumn to API Reference (GH#162)

    • Add docstrings to LogicalType classes (GH#168)

    • Add Woodwork image to index, clear outputs of Jupyter notebook in docs (GH#173)

    • Update contributing.md, release.md with all instructions (GH#176)

    • Add section for setting index and time index to start notebook (GH#179)

    • Rename changelog to Release Notes (GH#193)

    • Add section for standard tags to start notebook (GH#188)

    • Add Understanding Types and Tags user guide (GH#201)

    • Add missing docstring to list_logical_types (GH#202)

    • Add Woodwork Global Configuration Options guide (GH#215)

  • Testing Changes
    • Add tests that confirm dtypes are as expected after DataTable init (GH#152)

    • Remove unused none_df test fixture (GH#224)

    • Add test for LogicalType.__str__ method (GH#225)

Thanks to the following people for contributing to this release: @gsheni, @tamargrey, @thehomebrewnerd

v0.0.2 Sep 28, 2020#

  • Fixes
    • Fix formatting issue when printing global config variables (GH#138)

  • Changes
    • Change add_standard_tags to use_standard_Tags to better describe behavior (GH#149)

    • Change access of underlying dataframe to be through to_pandas with ._dataframe field on class (GH#146)

    • Remove replace_none parameter to DataTables (GH#146)

  • Documentation Changes
    • Add working code example to README and create Using Woodwork page (GH#103)

Thanks to the following people for contributing to this release: @gsheni, @tamargrey, @thehomebrewnerd

v0.1.0 Sep 24, 2020#

  • Add natural_language_threshold global config option used for Categorical/NaturalLanguage type inference (GH#135)

  • Add global config options and add datetime_format option for type inference (GH#134)

  • Fix bug with Integer and WholeNumber inference in column with pd.NA values (GH#133)

  • Add DataTable.ltypes property to return series of logical types (GH#131)

  • Add ability to create new datatable from specified columns with dt[[columns]] (GH#127)

  • Handle setting and tagging of index and time index columns (GH#125)

  • Add combined tag and ltype selection (GH#124)

  • Add changelog, and update changelog check to CI (GH#123)

  • Implement reset_semantic_tags (GH#118)

  • Implement DataTable getitem (GH#119)

  • Add remove_semantic_tags method (GH#117)

  • Add semantic tag selection (GH#106)

  • Add github action, rename to woodwork (GH#113)

  • Add license to setup.py (GH#112)

  • Reset semantic tags on logical type change (GH#107)

  • Add standard numeric and category tags (GH#100)

  • Change semantic_types to semantic_tags, a set of strings (GH#100)

  • Update dataframe dtypes based on logical types (GH#94)

  • Add select_logical_types to DataTable (GH#96)

  • Add pygments to dev-requirements.txt (GH#97)

  • Add replacing None with np.nan in DataTable init (GH#87)

  • Refactor DataColumn to make semantic_types and logical_type private (GH#86)

  • Add pandas_dtype to each Logical Type, and remove dtype attribute on DataColumn (GH#85)

  • Add set_semantic_types methods on both DataTable and DataColumn (GH#75)

  • Support passing camel case or snake case strings for setting logical types (GH#74)

  • Improve flexibility when setting semantic types (GH#72)

  • Add Whole Number Inference of Logical Types (GH#66)

  • Add dtypes property to DataTables and repr for DataColumn (GH#61)

  • Allow specification of semantic types during DataTable creation (GH#69)

  • Implements set_logical_types on DataTable (GH#65)

  • Add init files to tests to fix code coverage (GH#60)

  • Add AutoAssign bot (GH#59)

  • Add logical types validation in DataTables (GH#49)

  • Fix working_directory in CI (GH#57)

  • Add infer_logical_types for DataColumn (GH#45)

  • Fix ReadME library name, and code coverage badge (GH#56, GH#56)

  • Add code coverage (GH#51)

  • Improve and refactor the validation checks during initialization of a DataTable (GH#40)

  • Add dataframe attribute to DataTable (GH#39)

  • Update ReadME with minor usage details (GH#37)

  • Add License (GH#34)

  • Rename from datatables to datatables (GH#4)

  • Add Logical Types, DataTable, DataColumn (GH#3)

  • Add Makefile, setup.py, requirements.txt (GH#2)

  • Initial Release (GH#1)

Thanks to the following people for contributing to this release: @gsheni, @tamargrey, @thehomebrewnerd