Dataset Definition

A dataset definition in datamaestro combines declarative metadata with imperative data processing logic. This page explains how to create your own dataset definitions.

Components of a Dataset

Every dataset definition includes:

  1. ID: Unique identifier determined by module location and class/function name

  2. Meta-information: Tags, tasks, URL, description

  3. Resources: What files/data to fetch (defined as class attributes)

  4. Data access: How to structure the data in Python

Class-based Datasets (Preferred)

The preferred way to define datasets uses class-based definitions where resources are declared as class attributes. The framework automatically detects resources and builds a dependency DAG.

Basic Example

File: datamaestro_image/config/com/lecun.py
from datamaestro_image.data import ImageClassification, LabelledImages
from datamaestro.data.tensor import IDX
from datamaestro.download.single import FileDownloader
from datamaestro.definitions import Dataset, dataset

@dataset(url="http://yann.lecun.com/exdb/mnist/")
class MNIST(Dataset):
    """The MNIST database of handwritten digits."""

    TRAIN_IMAGES = FileDownloader(
        "train_images.idx",
        "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz",
    )
    TRAIN_LABELS = FileDownloader(
        "train_labels.idx",
        "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz",
    )
    TEST_IMAGES = FileDownloader(
        "test_images.idx",
        "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz",
    )
    TEST_LABELS = FileDownloader(
        "test_labels.idx",
        "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz",
    )

    def config(self) -> ImageClassification:
        return ImageClassification.C(
            train=LabelledImages(
                images=IDX(path=self.TRAIN_IMAGES.path),
                labels=IDX(path=self.TRAIN_LABELS.path),
            ),
            test=LabelledImages(
                images=IDX(path=self.TEST_IMAGES.path),
                labels=IDX(path=self.TEST_LABELS.path),
            ),
        )

Advantages of class-based definitions:

  1. Explicit pipeline — dependencies between resources are visible

  2. Transient intermediaries — intermediate files can be deleted after processing

  3. Auto-naming — resource names are auto-detected from class attribute names

  4. Two-path safety — incomplete downloads never appear at the final path

Dataset Utilities

Dataset.data_path

A class property that returns the Path where this dataset’s data is stored on disk. This is a shortcut for MyDataset.__dataset__.datapath:

@dataset(url="http://example.com")
class Documents(Dataset):
    ...

@dataset(url="http://example.com")
class MyTask(Dataset):
    DOCS = reference(Documents)

    def config(self) -> TaskData:
        # Use Documents.data_path to locate sibling data
        store_path = Documents.data_path / "docstore"
        return TaskData.C(path=store_path)

Resource Pipelines

Resources can depend on other resources, forming a processing pipeline:

@dataset(url="http://example.com")
class ProcessedDataset(MyData):
    # Raw download — deleted after processing completes
    RAW = FileDownloader(
        "raw.gz", "http://example.com/data.gz",
        transient=True,
    )
    # Processed output — kept permanently
    PROCESSED = MyProcessor.from_file(RAW)

    @classmethod
    def __create_dataset__(cls, dataset: AbstractDataset):
        return cls.C(path=cls.PROCESSED.path)

The transient=True flag tells the framework to delete intermediate data once all downstream resources are COMPLETE.

Dataset ID Naming

The dataset ID is derived from:

  1. Module path: datamaestro_image.config.com.lecuncom.lecun

  2. Class/function name: CamelCase is converted to dot-separated lowercase (e.g., MNISTmnist, ProcessedMNISTprocessed.mnist, TrainTexttripleFulltrain.texttriple.full)

  3. Final ID: com.lecun.mnist

The convention follows reversed domain names (like Java packages):

The @dataset Annotation

The @dataset decorator is the main annotation for defining datasets.

class datamaestro.definitions.dataset(base=None, *, timestamp: str | None = None, id: None | str = None, url: None | str = None, size: None | int | str = None, doi: None | str = None, as_prepare: bool = False, variants: Variants | type | None = None)

Dataset decorator

Meta-datasets are not associated with any base type.

Parameters:
  • base – The base type (or None if inferred from type annotation).

  • timestamp – If the dataset evolves, specify its timestamp.

  • id – Dataset ID override. Behavior depends on format: - Full ID (e.g., “com.example.data”): used as-is if it has 3+ components - Suffix with dot prefix (e.g., “.8.topics”): appended to module path - Single component (e.g., “mnist”): replaces the class name in the path

  • url – The URL associated with the dataset.

  • size – The size of the dataset (should be a parsable format).

  • doi – The DOI of the corresponding paper.

  • as_prepare – Resources are setup within the method itself

  • variants – Optional Variants instance (or subclass) that declares the variant space. When provided, callers select a specific variant via a query-style suffix on the id (e.g. "pkg.id[name=x,streaming=true]"); see datamaestro.variants.

Parameters

Parameter

Description

base

The base data type class (e.g., ImageClassification). Can be inferred from the class hierarchy.

id

Override the automatic ID. Use a "." prefix to append to the module path (e.g., ".8.topics" becomes module.path.8.topics).

url

URL to the dataset’s homepage.

doi

DOI of the associated paper.

timestamp

Version timestamp for evolving datasets.

size

Dataset size (for documentation).

as_prepare (deprecated)

If True, the function receives the dataset object for manual resource handling.

ID Override Examples

# Full ID override
@dataset(MyType, id="org.example.custom")
class IgnoredName(MyType):
    ...

# Append suffix to module path (in module gov.nist.trec.adhoc)
@dataset(MyType, id=".8.topics")  # Results in gov.nist.trec.adhoc.8.topics
class Trec8Topics(MyType):
    ...

# Single component suffix (in module com.example)
@dataset(MyType, id=".v2")  # Results in com.example.v2
class Original(MyType):
    ...

# Empty string uses module path only
@dataset(MyType, id="")  # Results in com.example (no class name)
class Main(MyType):
    ...

Resources

Resources are defined as class attributes on dataset classes. See the Download Resources for the full resource API reference and all available resource types.

Single Files

Use FileDownloader for single file downloads:

from datamaestro.download.single import FileDownloader

@dataset(url="http://example.com")
class MyDataset(CSVData):
    DATA = FileDownloader("data.csv", "http://example.com/data.csv")

Archives

Use ZipDownloader or TarDownloader for archives:

from datamaestro.download.archive import ZipDownloader, TarDownloader

@dataset(url="http://example.com")
class ZippedDataset(MyData):
    DATA = ZipDownloader("data", "http://example.com/archive.zip")

@dataset(url="http://example.com")
class TarDataset(MyData):
    DATA = TarDownloader(
        "data", "http://example.com/archive.tar.gz",
        subpath="archive/subdir",
    )

HuggingFace Integration

from datamaestro.download.huggingface import HFDownloader

@dataset(url="https://huggingface.co/datasets/squad")
class Squad(QADataset):
    HF_DATA = HFDownloader("squad_data", "squad")

Referencing Other Datasets

Use reference to declare a dependency on another dataset class. The referenced dataset is prepared automatically when needed:

from datamaestro.download import reference

@dataset(url="http://example.com")
class MyTask(TaskData):
    DOCUMENTS = reference(DocumentsDataset)

    def config(self) -> TaskData:
        return TaskData.C(
            documents=self.DOCUMENTS.config(),
        )

Call .config() on the reference resource to obtain the referenced dataset’s configuration object (this is equivalent to .prepare()).

Note

reference() accepts the dataset class as its first positional argument. The older keyword form reference(reference=DocumentsDataset) still works but is no longer necessary.

Data Types

Data types define the structure of returned data. They inherit from datamaestro.data.Base and use experimaestro’s configuration system. See the Data Types for full reference.

Built-in Types

Custom Data Types

Create custom data types by inheriting from Base. Use Param from experimaestro to define typed parameters:

from experimaestro import Config, Param
from datamaestro.data import Base

class MyCustomData(Base):
    """My custom data type"""
    path: Param[Path]
    """Path to the data file"""

    num_classes: Param[int] = 10
    """Number of classes"""

    def load(self):
        """Load the data"""
        import pandas as pd
        return pd.read_csv(self.path)

Tags and Tasks

Add semantic metadata with datatags() and datatasks() decorators:

from datamaestro.definitions import dataset, datatags, datatasks

@datatags("benchmark", "classification", "vision")
@datatasks("image-classification", "digit-recognition")
@dataset(url="http://example.com")
class MNIST(ImageClassification):
    """Dataset description"""
    TRAIN_IMAGES = FileDownloader(...)
    ...

Tags and tasks are searchable via the CLI:

datamaestro search tag:benchmark
datamaestro search task:classification

Metadatasets

Use metadataset to share common metadata across related datasets:

from datamaestro.definitions import metadataset, dataset

@datatags("trec", "information-retrieval")
@metadataset(IRDataset)
class TRECBase:
    """Common base for TREC datasets"""
    pass

@dataset(TRECBase, url="https://trec.nist.gov/...")
class Robust04(IRDataset):
    ...

@dataset(TRECBase, url="https://trec.nist.gov/...")
class Robust05(IRDataset):
    ...

File Validation

Use HashCheck to validate downloaded files with checksums:

from datamaestro.download.single import FileDownloader
from datamaestro.utils import HashCheck

DATA = FileDownloader(
    "data.csv",
    "http://example.com/data.csv",
    checker=HashCheck("sha256", "abc123...")
)

Testing Datasets

Test your dataset definitions:

def test_my_dataset():
    from datamaestro import prepare_dataset

    ds = prepare_dataset("com.example.mydataset")
    assert ds.train is not None
    assert ds.test is not None

Use pytest with the --datamaestro-download flag to actually download during tests (otherwise downloads are skipped).

Deprecated: Decorator-based Datasets

Deprecated since version The: decorator-based API still works but emits deprecation warnings. Migrate to the class-based approach described above.

The legacy approach uses function decorators to define datasets:

DEPRECATED — use class-based approach instead
from datamaestro.download.single import filedownloader
from datamaestro.definitions import dataset

@filedownloader("train_images.idx", "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz")
@filedownloader("train_labels.idx", "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz")
@filedownloader("test_images.idx", "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz")
@filedownloader("test_labels.idx", "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz")
@dataset(
    ImageClassification,
    url="http://yann.lecun.com/exdb/mnist/",
)
def MNIST(train_images, train_labels, test_images, test_labels):
    """The MNIST database"""
    return {
        "train": LabelledImages(
            images=IDX(path=train_images),
            labels=IDX(path=train_labels)
        ),
        "test": LabelledImages(
            images=IDX(path=test_images),
            labels=IDX(path=test_labels)
        ),
    }

In this legacy pattern:

  • Download decorators are stacked above @dataset

  • File paths are passed as arguments to the dataset function

  • The function returns a dict or data object

See Deprecated: Download Decorators for the full list of deprecated download decorator patterns.