Releases: tensorflow/datasets
v4.3.0
API:
• Add dataset.info.splits['train'].num_shards to expose the number of shards to the user
• Add tfds.features.Dataset to have a field containing sub-datasets (e.g. used in RL datasets)
• Add dtype and tf.uint16 supports for tfds.features.Video
• Add DatasetInfo.license field to add redistributing information
• Better tfds.benchmark(ds) (compatible with any iterator, not just tf.data, better colab representation)
Other
• Faster tfds.as_numpy() (avoid extra tf.Tensor <> np.array copy)
• Better tfds.as_dataframe visualisation (Sequence, ragged tensor, semantic masks with use_colormap)
• (experimental) community datasets support. To allow dynamically import datasets defined outside the TFDS repository.
• (experimental) Add a hugging-face compatibility wrapper to use Hugging-face datasets directly in TFDS.
• (experimental) Riegelli format support
• (experimental) Add DatasetInfo.disable_shuffling to force examples to be read in generation order.
• Add .copy, .format methods to GPath objects
• Many bug fixes
Testing:
• Supports custom BuilderConfig in DatasetBuilderTest
• DatasetBuilderTest now has a dummy_data class property which can be used in setUpClass
• Add add_tfds_id and cardinality support to tfds.testing.mock_data
And of course, many new datasets and datasets updates.
We would like to thank all the TFDS contributors!
v4.2.0
API:
- Add
tfds buildto the CLI. See documentation. - DownloadManager now returns Pathlib-like objects
- Datasets returned by
tfds.as_numpyare compatible withlen(ds) - New
tfds.features.Datasetto represent nested datasets - Add
tfds.ReadConfig(add_tfds_id=True)to add a unique id to the exampleex['tfds_id'](e.g.b'train.tfrecord-00012-of-01024__123') - Add
num_parallel_callsoption totfds.ReadConfigto overwrite to defaultAUTOTUNEoption tfds.ImageFoldernow supporttfds.decode.SkipDecoder- Add multichannel audio support to
tfds.features.Audio - Better
tfds.as_dataframevisualization (ffmpeg video if installed, bounding boxes,...) - Add
try_gcstotfds.builder(..., try_gcs=True) - Simpler
BuilderConfigdefinition: classVERSIONandRELEASE_NOTESare applied to allBuilderConfig. Config description is now optional.
Breaking compatibility changes:
- Removed configs for all text datasets. Only plain text version is kept. For example:
multi_nli/plain_text->multi_nli. - To guarantee better deterministic, new validations are performed on the keys when creating a dataset (to avoid filenames as keys (non-deterministic) and restrict key to
str,bytesandint). New errors likely indicates an issue in the dataset implementation. tfds.core.benchmarknow returns apd.DataFrame(instead of adict)tfds.unitsis not visible anymore from the public API
Bug fixes:
- Support 0-len sequence with images of dynamic shape (Fix #2616)
- Progression bar correctly updated when copying files.
- Many bug fixes (GPath consistency with pathlib, s3 compatibility, TQDM visual artifacts, GCS crash on windows, re-download when checksums updated,...)
- Better debugging and error message (e.g. human readable size,...)
- Allow
max_examples_per_splits=0intfds build --max_examples_per_splits=0to test_split_generatorsonly (without_generate_examples).
And of course, many new datasets and datasets updates.
Thank you the community for their many valuable contributions and to supporting us in this project!!!
v4.1.0
-
When generating a dataset, if download fails for any reason, it is now possible to manually download the data. See doc.
-
Simplification of the dataset creation API.
- We've made it is easier to create datasets outside TFDS repository (see our updated dataset creation guide).
_split_generatorsshould now returns{'split_name': self._generate_examples(), ...}(but current datasets are backward compatible).- All dataset inherit from
tfds.core.GeneratorBasedBuilder. Converting a dataset to beam now only require changing_generate_examples(see example and doc). tfds.core.SplitGenerator,tfds.core.BeamBasedBuilderare deprecated and will be removed in future version.
-
Better
pathlib.Path,os.PathLikecompatibility:dl_manager.manual_dirnow returns a pathlib-Like object. Example:
text = (dl_manager.manual_dir / 'downloaded-text.txt').read_text()
- Note: Other
dl_manager.download,.extract,... will return pathlib-like objects in future versions FeatureConnector,... and most functions should acceptPathLikeobjects. Let us know if some functions you need are missing.- Add a
tfds.core.as_pathto create pathlib.Path-like objects compatible with GCS (e.g.tfds.core.as_path('gs://my-bucket/labels.csv').read_text()).
-
Other bug fixes and improvement. E.g.
- Add
verify_ssl=option totfds.download.DownloadConfigto disable SSH certificate during download. BuilderConfigare now compatible with Beam datasets #2348--record_checksumsnow assume the new dataset-as-folder modeltfds.features.Imagescan accept encodedbytesimages directly (useful when used withimg_name, img_bytes = dl_manager.iter_archive('images.zip')).- Doc API now show deprecated methods, abstract methods to overwrite are now documented.
- You can generate
imagenet2012with only a single split (e.g. only the validation data). Other split will be skipped if not present.
- Add
-
And of course new datasets
Thank you to all our contributors for improving TFDS!
v4.0.1
- Fix
tfds.loadwhen generation code isn't present - Fix improve GCS compatibility.
Thanks @carlthome for reporting and fixing the issue.
v4.0.0
API changes, new features:
- Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository.
tfds.loadcan now load dataset without using the generation class. Sotfds.load('my_dataset:1.0.0')can work even ifMyDataset.VERSION == '2.0.0'(See #2493).- Add a new TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail)
tfds.testing.mock_datadoes not require metadata files anymore!- Add
tfds.as_dataframe(ds, ds_info)with custom visualisation (example) - Add
tfds.even_splitsto generate subsplits (e.g.tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...] - Add new
DatasetBuilder.RELEASE_NOTESproperty - tfds.features.Image now supports PNG with 4-channels
tfds.ImageFoldernow supports custom shape, dtype- Downloaded URLs are available through
MyDataset.url_infos - Add
skip_prefetchoption totfds.ReadConfig as_supervised=Truesupport fortfds.show_examples,tfds.as_dataframe
Breaking compatible changes:
tfds.as_numpy()now returns an iterable which can be iterated multiple times. To migratenext(ds)->next(iter(ds))- Rename
tfds.features.text.Xyz->tfds.deprecated.text.Xyz - Remove
DatasetBuilder.IN_DEVELOPMENTproperty - Remove
tfds.core.disallow_positional_args(should use Py3*,instead) - tfds.features can now be saved/loaded, you may have to overwrite FeatureConnector.from_json_content and
FeatureConnector.to_json_contentto support this feature. - Stop testing against TF 1.15. Requires Python 3.6.8+.
Other bug fixes:
- Better archive extension detection for
dl_manager.download_and_extract - Fix
tfds.__version__in TFDS nightly to be PEP440 compliant - Fix crash when GCS not available
- Script to detect dead-urls
- Improved open-source workflow, contributor guide, documentation
- Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,...
And of course, new datasets, datasets updates.
A gigantic thanks to our community which has helped us debugging issues and with the implementation of many features, especially vijayphoenix@ for being a major contributor.
v3.2.1
- Fix an issue with GCS on Windows.
v3.2.0
Future breaking change:
- The
tfds.features.textencoding API is deprecated. Please use tensorflow_text instead.
New features
API:
- Add a
tfds.ImageFolderandtfds.TranslateFolderto easily create custom datasets with your custom data. - Add a
tfds.ReadConfig(input_context=)to shard dataset, for better multi-worker compatibility (#1426). - The default
data_dircan be controlled by theTFDS_DATA_DIRenvironment variable. - Better usability when developing datasets outside TFDS
- Downloads are always cached
- Checksum are optional
- Added a
tfds.show_statistics(ds_info)to display FACETS OVERVIEW. Note: This require the dataset to have been generated with the statistics. - Open source various scripts to help deployment/documentation (Generate catalog documentation, export all metadata files,...)
Documentation:
- Catalog display images (example)
- Catalog shows which dataset have been recently added and are only available in
tfds-nightlynights_stay
Breaking compatibility change:
- Fix deterministic example order on Windows when path was used as key (this only impact a few datasets). Now example order should be the same on all platforms.
- Remove
tfds.load('image_label_folder')in favor of the more user-friendlytfds.ImageFolder
Other:
- Various performances improvements for both generation and reading (e.g. use
__slot__, fix parallelisation bug intf.data.TFRecordReader,...) - Various fixes (typo, types annotations, better error messages, fixing dead links, better windows compatibility,...)
Thanks to all our contributors who help improving the state of dataset for the entire research community!
v3.1.0
Beaking compatibility change:
- Rename
tfds.core.NamedSplit,tfds.core.SplitBase->tfds.Split. Nowtfds.Split.TRAIN,... are instance oftfds.Split - Remove deprecated
num_shardsargument fromtfds.core.SplitGenerator. This argument was ignored as shards are automatically computed.
Future breaking compatibility changes:
- Rename
interleave_parallel_reads->interleave_cycle_lengthfortfds.ReadConfig. - Invert ds, ds_info argument orders for
tfds.show_examplesFuture breaking change: - The
tfds.features.textencoding API is deprecated. Please usetensorflow_textinstead.
Other changes:
- Testing: Add support for custom decoders in
tfds.testing.mock_data - Documentation: shows which datasets are only present in
tfds-nightly - Documentation: display images for supported datasets
- API: Add
tfds.builder_cls(name)to access a DatasetBuilder class by name - API: Add
info.split['train'].filenamesfor access to the tf-record files. - API: Add
tfds.core.add_data_dirto register an additional data dir - Remove most
ds.with_optionswhich where applied by TFDS. Now use tf.data default. - Other bug fixes and improvement (Better error messages, windows compatibility,...)
Thank you all for your contributions, and helping us make TFDS better for everyone!
v3.0.0
Breaking changes:
- Legacy mode
tfds.experiment.S3has been removed - New
image_classificationsection. Some datasets have been move there fromimages. in_memoryargument has been removed fromas_dataset/tfds.load(small datasets are now auto-cached).DownloadConfigdo not append the dataset name anymore (manual data should be in<manual_dir>/instead of<manual_dir>/<dataset_name>/)- Tests now check that all
dl_manager.downloadurls has registered checksums. To opt-out, addSKIP_CHECKSUMS = Trueto yourDatasetBuilderTestCase. tfds.loadnow always returnstf.compat.v2.Dataset. If you're using still usingtf.compat.v1:- Use
tf.compat.v1.data.make_one_shot_iterator(ds)rather thands.make_one_shot_iterator() - Use
isinstance(ds, tf.compat.v2.Dataset)instead ofisinstance(ds, tf.data.Dataset)
- Use
tfds.Split.ALLhas been removed from the API.
Future breaking change:
- The
tfds.features.textencoding API is deprecated. Please use tensorflow_text instead. num_shardsargument oftfds.core.SplitGeneratoris currently ignored and will be removed in the next version.
Features:
DownloadManageris now pickable (can be used inside Beam pipelines)tfds.features.Audio:- Support float as returned value
- Expose sample_rate through
info.features['audio'].sample_rate - Support for encoding audio features from file objects
- Various bug fixes, better error messages, documentation improvements
- More datasets
Thank you to all our contributors for helping us make TFDS better for everyone!
v2.1.0
New features:
- Datasets expose
info.dataset_sizeandinfo.download_size. All datasets generated with 2.1.0 cannot be loaded with previous version (previous datasets can be read with2.1.0however). - Auto-caching small datasets.
in_memoryargument is deprecated and will be removed in a future version. - Datasets expose their cardinality
num_examples = tf.data.experimental.cardinality(ds)(Requires tf-nightly or TF >= 2.2.0) - Get the number of example in a sub-splits with:
info.splits['train[70%:]'].num_examples