Dataset Definition
==================

A dataset definition in datamaestro combines declarative metadata with imperative
data processing logic. This page explains how to create your own dataset definitions.

Components of a Dataset
-----------------------

Every dataset definition includes:

1. **ID**: Unique identifier determined by module location and class/function name
2. **Meta-information**: Tags, tasks, URL, description
3. **Resources**: What files/data to fetch (defined as class attributes)
4. **Data access**: How to structure the data in Python

Class-based Datasets (Preferred)
================================

The preferred way to define datasets uses class-based definitions where
resources are declared as class attributes. The framework automatically
detects resources and builds a dependency DAG.

Basic Example
-------------

.. code-block:: python
    :caption: File: ``datamaestro_image/config/com/lecun.py``

    from datamaestro_image.data import ImageClassification, LabelledImages
    from datamaestro.data.tensor import IDX
    from datamaestro.download.single import FileDownloader
    from datamaestro.definitions import Dataset, dataset

    @dataset(url="http://yann.lecun.com/exdb/mnist/")
    class MNIST(Dataset):
        """The MNIST database of handwritten digits."""

        TRAIN_IMAGES = FileDownloader(
            "train_images.idx",
            "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz",
        )
        TRAIN_LABELS = FileDownloader(
            "train_labels.idx",
            "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz",
        )
        TEST_IMAGES = FileDownloader(
            "test_images.idx",
            "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz",
        )
        TEST_LABELS = FileDownloader(
            "test_labels.idx",
            "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz",
        )

        def config(self) -> ImageClassification:
            return ImageClassification.C(
                train=LabelledImages(
                    images=IDX(path=self.TRAIN_IMAGES.path),
                    labels=IDX(path=self.TRAIN_LABELS.path),
                ),
                test=LabelledImages(
                    images=IDX(path=self.TEST_IMAGES.path),
                    labels=IDX(path=self.TEST_LABELS.path),
                ),
            )

Advantages of class-based definitions:

1. **Explicit pipeline** --- dependencies between resources are visible
2. **Transient intermediaries** --- intermediate files can be deleted after processing
3. **Auto-naming** --- resource names are auto-detected from class attribute names
4. **Two-path safety** --- incomplete downloads never appear at the final path

Dataset Utilities
-----------------

``Dataset.data_path``
    A class property that returns the ``Path`` where this dataset's data is
    stored on disk. This is a shortcut for ``MyDataset.__dataset__.datapath``:

    .. code-block:: python

        @dataset(url="http://example.com")
        class Documents(Dataset):
            ...

        @dataset(url="http://example.com")
        class MyTask(Dataset):
            DOCS = reference(Documents)

            def config(self) -> TaskData:
                # Use Documents.data_path to locate sibling data
                store_path = Documents.data_path / "docstore"
                return TaskData.C(path=store_path)

Resource Pipelines
------------------

Resources can depend on other resources, forming a processing pipeline:

.. code-block:: python

    @dataset(url="http://example.com")
    class ProcessedDataset(MyData):
        # Raw download — deleted after processing completes
        RAW = FileDownloader(
            "raw.gz", "http://example.com/data.gz",
            transient=True,
        )
        # Processed output — kept permanently
        PROCESSED = MyProcessor.from_file(RAW)

        @classmethod
        def __create_dataset__(cls, dataset: AbstractDataset):
            return cls.C(path=cls.PROCESSED.path)

The ``transient=True`` flag tells the framework to delete intermediate data
once all downstream resources are COMPLETE.

Dataset ID Naming
-----------------

The dataset ID is derived from:

1. **Module path**: ``datamaestro_image.config.com.lecun`` → ``com.lecun``
2. **Class/function name**: CamelCase is converted to dot-separated lowercase
   (e.g., ``MNIST`` → ``mnist``, ``ProcessedMNIST`` → ``processed.mnist``,
   ``TrainTexttripleFull`` → ``train.texttriple.full``)
3. **Final ID**: ``com.lecun.mnist``

The convention follows reversed domain names (like Java packages):

- ``com.lecun.mnist`` for http://yann.lecun.com/exdb/mnist/
- ``org.trec.robust04`` for https://trec.nist.gov/ ROBUST04 track
- ``io.huggingface.squad`` for HuggingFace datasets

The ``@dataset`` Annotation
============================

The ``@dataset`` decorator is the main annotation for defining datasets.

.. autoclass:: datamaestro.definitions.dataset

Parameters
----------

.. list-table::
   :header-rows: 1
   :widths: 20 80

   * - Parameter
     - Description
   * - ``base``
     - The base data type class (e.g., ``ImageClassification``). Can be inferred from the class hierarchy.
   * - ``id``
     - Override the automatic ID. Use a ``"."`` prefix to append to the module path (e.g., ``".8.topics"`` becomes ``module.path.8.topics``).
   * - ``url``
     - URL to the dataset's homepage.
   * - ``doi``
     - DOI of the associated paper.
   * - ``timestamp``
     - Version timestamp for evolving datasets.
   * - ``size``
     - Dataset size (for documentation).
   * - ``as_prepare`` (deprecated)
     - If True, the function receives the dataset object for manual resource handling.

ID Override Examples
--------------------

.. code-block:: python

    # Full ID override
    @dataset(MyType, id="org.example.custom")
    class IgnoredName(MyType):
        ...

    # Append suffix to module path (in module gov.nist.trec.adhoc)
    @dataset(MyType, id=".8.topics")  # Results in gov.nist.trec.adhoc.8.topics
    class Trec8Topics(MyType):
        ...

    # Single component suffix (in module com.example)
    @dataset(MyType, id=".v2")  # Results in com.example.v2
    class Original(MyType):
        ...

    # Empty string uses module path only
    @dataset(MyType, id="")  # Results in com.example (no class name)
    class Main(MyType):
        ...

Resources
=========

Resources are defined as class attributes on dataset classes. See the
:doc:`api/download` for the full resource API reference and all available
resource types.

Single Files
------------

Use :py:class:`~datamaestro.download.single.FileDownloader` for single file downloads:

.. code-block:: python

    from datamaestro.download.single import FileDownloader

    @dataset(url="http://example.com")
    class MyDataset(CSVData):
        DATA = FileDownloader("data.csv", "http://example.com/data.csv")

Archives
--------

Use :py:class:`~datamaestro.download.archive.ZipDownloader` or
:py:class:`~datamaestro.download.archive.TarDownloader` for archives:

.. code-block:: python

    from datamaestro.download.archive import ZipDownloader, TarDownloader

    @dataset(url="http://example.com")
    class ZippedDataset(MyData):
        DATA = ZipDownloader("data", "http://example.com/archive.zip")

    @dataset(url="http://example.com")
    class TarDataset(MyData):
        DATA = TarDownloader(
            "data", "http://example.com/archive.tar.gz",
            subpath="archive/subdir",
        )

HuggingFace Integration
-----------------------

.. code-block:: python

    from datamaestro.download.huggingface import HFDownloader

    @dataset(url="https://huggingface.co/datasets/squad")
    class Squad(QADataset):
        HF_DATA = HFDownloader("squad_data", "squad")

Referencing Other Datasets
--------------------------

Use :py:class:`~datamaestro.download.reference` to declare a dependency on
another dataset class. The referenced dataset is prepared automatically when
needed:

.. code-block:: python

    from datamaestro.download import reference

    @dataset(url="http://example.com")
    class MyTask(TaskData):
        DOCUMENTS = reference(DocumentsDataset)

        def config(self) -> TaskData:
            return TaskData.C(
                documents=self.DOCUMENTS.config(),
            )

Call ``.config()`` on the reference resource to obtain the referenced dataset's
configuration object (this is equivalent to ``.prepare()``).

.. note::

    ``reference()`` accepts the dataset class as its first positional argument.
    The older keyword form ``reference(reference=DocumentsDataset)`` still works
    but is no longer necessary.

Links to Other Datasets (by ID)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use :py:func:`~datamaestro.download.links.links` to reference datasets by their
string ID rather than by class:

.. code-block:: python

    from datamaestro.download.links import links

    @dataset(url="http://example.com")
    class ExtendedDataset(ExtendedData):
        BASE = links("base", "com.example.base_dataset")

Data Types
==========

Data types define the structure of returned data. They inherit from
:py:class:`datamaestro.data.Base` and use experimaestro's configuration system.
See the :doc:`api/data` for full reference.

Built-in Types
--------------

- :py:class:`datamaestro.data.Base` - Base class for all data types
- :py:class:`datamaestro.data.File` - Single file reference
- :py:class:`datamaestro.data.csv.Generic` - CSV file
- :py:class:`datamaestro.data.csv.Matrix` - CSV with numeric data
- :py:class:`datamaestro.data.tensor.IDX` - IDX tensor format (MNIST)
- :py:class:`datamaestro.data.ml.Supervised` - Supervised learning data

Custom Data Types
-----------------

Create custom data types by inheriting from :py:class:`~datamaestro.data.Base`.
Use ``Param`` from experimaestro to define typed parameters:

.. code-block:: python

    from experimaestro import Config, Param
    from datamaestro.data import Base

    class MyCustomData(Base):
        """My custom data type"""
        path: Param[Path]
        """Path to the data file"""

        num_classes: Param[int] = 10
        """Number of classes"""

        def load(self):
            """Load the data"""
            import pandas as pd
            return pd.read_csv(self.path)

Tags and Tasks
==============

Add semantic metadata with :py:func:`~datamaestro.definitions.datatags` and
:py:func:`~datamaestro.definitions.datatasks` decorators:

.. code-block:: python

    from datamaestro.definitions import dataset, datatags, datatasks

    @datatags("benchmark", "classification", "vision")
    @datatasks("image-classification", "digit-recognition")
    @dataset(url="http://example.com")
    class MNIST(ImageClassification):
        """Dataset description"""
        TRAIN_IMAGES = FileDownloader(...)
        ...

Tags and tasks are searchable via the CLI:

.. code-block:: bash

    datamaestro search tag:benchmark
    datamaestro search task:classification

Metadatasets
============

Use :py:class:`~datamaestro.definitions.metadataset` to share common metadata
across related datasets:

.. code-block:: python

    from datamaestro.definitions import metadataset, dataset

    @datatags("trec", "information-retrieval")
    @metadataset(IRDataset)
    class TRECBase:
        """Common base for TREC datasets"""
        pass

    @dataset(TRECBase, url="https://trec.nist.gov/...")
    class Robust04(IRDataset):
        ...

    @dataset(TRECBase, url="https://trec.nist.gov/...")
    class Robust05(IRDataset):
        ...

File Validation
===============

Use :py:class:`~datamaestro.utils.HashCheck` to validate downloaded files with checksums:

.. code-block:: python

    from datamaestro.download.single import FileDownloader
    from datamaestro.utils import HashCheck

    DATA = FileDownloader(
        "data.csv",
        "http://example.com/data.csv",
        checker=HashCheck("sha256", "abc123...")
    )

Testing Datasets
================

Test your dataset definitions:

.. code-block:: python

    def test_my_dataset():
        from datamaestro import prepare_dataset

        ds = prepare_dataset("com.example.mydataset")
        assert ds.train is not None
        assert ds.test is not None

Use ``pytest`` with the ``--datamaestro-download`` flag to actually download
during tests (otherwise downloads are skipped).


.. _deprecated-decorator-datasets:

Deprecated: Decorator-based Datasets
=====================================

.. deprecated::
   The decorator-based API still works but emits deprecation warnings.
   Migrate to the class-based approach described above.

The legacy approach uses function decorators to define datasets:

.. code-block:: python
    :caption: DEPRECATED — use class-based approach instead

    from datamaestro.download.single import filedownloader
    from datamaestro.definitions import dataset

    @filedownloader("train_images.idx", "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz")
    @filedownloader("train_labels.idx", "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz")
    @filedownloader("test_images.idx", "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz")
    @filedownloader("test_labels.idx", "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz")
    @dataset(
        ImageClassification,
        url="http://yann.lecun.com/exdb/mnist/",
    )
    def MNIST(train_images, train_labels, test_images, test_labels):
        """The MNIST database"""
        return {
            "train": LabelledImages(
                images=IDX(path=train_images),
                labels=IDX(path=train_labels)
            ),
            "test": LabelledImages(
                images=IDX(path=test_images),
                labels=IDX(path=test_labels)
            ),
        }

In this legacy pattern:

- Download decorators are stacked above ``@dataset``
- File paths are passed as arguments to the dataset function
- The function returns a dict or data object

See :ref:`deprecated-download-decorators` for the full list of deprecated
download decorator patterns.