Dataset Definition

A dataset definition in datamaestro combines declarative metadata with imperative data processing logic. This page explains how to create your own dataset definitions.

Components of a Dataset

Every dataset definition includes:

ID: Unique identifier determined by module location and class/function name
Meta-information: Tags, tasks, URL, description
Resources: What files/data to fetch (defined as class attributes)
Data access: How to structure the data in Python

Class-based Datasets (Preferred)

The preferred way to define datasets uses class-based definitions where resources are declared as class attributes. The framework automatically detects resources and builds a dependency DAG.

Basic Example

File: datamaestro_image/config/com/lecun.py

from datamaestro_image.data import ImageClassification, LabelledImages
from datamaestro.data.tensor import IDX
from datamaestro.download.single import FileDownloader
from datamaestro.definitions import Dataset, dataset

@dataset(url="http://yann.lecun.com/exdb/mnist/")
class MNIST(Dataset):
    """The MNIST database of handwritten digits."""

    TRAIN_IMAGES = FileDownloader(
        "train_images.idx",
        "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz",
    )
    TRAIN_LABELS = FileDownloader(
        "train_labels.idx",
        "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz",
    )
    TEST_IMAGES = FileDownloader(
        "test_images.idx",
        "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz",
    )
    TEST_LABELS = FileDownloader(
        "test_labels.idx",
        "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz",
    )

    def config(self) -> ImageClassification:
        return ImageClassification.C(
            train=LabelledImages(
                images=IDX(path=self.TRAIN_IMAGES.path),
                labels=IDX(path=self.TRAIN_LABELS.path),
            ),
            test=LabelledImages(
                images=IDX(path=self.TEST_IMAGES.path),
                labels=IDX(path=self.TEST_LABELS.path),
            ),
        )

Advantages of class-based definitions:

Explicit pipeline — dependencies between resources are visible
Transient intermediaries — intermediate files can be deleted after processing
Auto-naming — resource names are auto-detected from class attribute names
Two-path safety — incomplete downloads never appear at the final path

Dataset Utilities

Dataset.data_path

A class property that returns the Path where this dataset’s data is stored on disk. This is a shortcut for MyDataset.__dataset__.datapath:

@dataset(url="http://example.com")
class Documents(Dataset):
    ...

@dataset(url="http://example.com")
class MyTask(Dataset):
    DOCS = reference(Documents)

    def config(self) -> TaskData:
        # Use Documents.data_path to locate sibling data
        store_path = Documents.data_path / "docstore"
        return TaskData.C(path=store_path)

Resource Pipelines

Resources can depend on other resources, forming a processing pipeline:

@dataset(url="http://example.com")
class ProcessedDataset(MyData):
    # Raw download — deleted after processing completes
    RAW = FileDownloader(
        "raw.gz", "http://example.com/data.gz",
        transient=True,
    )
    # Processed output — kept permanently
    PROCESSED = MyProcessor.from_file(RAW)

    @classmethod
    def __create_dataset__(cls, dataset: AbstractDataset):
        return cls.C(path=cls.PROCESSED.path)

The transient=True flag tells the framework to delete intermediate data once all downstream resources are COMPLETE.

Dataset ID Naming

The dataset ID is derived from:

Module path: datamaestro_image.config.com.lecun → com.lecun
Class/function name: CamelCase is converted to dot-separated lowercase (e.g., MNIST → mnist, ProcessedMNIST → processed.mnist, TrainTexttripleFull → train.texttriple.full)
Final ID: com.lecun.mnist

The convention follows reversed domain names (like Java packages):

com.lecun.mnist for http://yann.lecun.com/exdb/mnist/
org.trec.robust04 for https://trec.nist.gov/ ROBUST04 track
io.huggingface.squad for HuggingFace datasets

The `@dataset` Annotation

The @dataset decorator is the main annotation for defining datasets.

Dataset decorator

Meta-datasets are not associated with any base type.

Parameters:

base – The base type (or None if inferred from type annotation).
timestamp – If the dataset evolves, specify its timestamp.
id – Dataset ID override. Behavior depends on format: - Full ID (e.g., “com.example.data”): used as-is if it has 3+ components - Suffix with dot prefix (e.g., “.8.topics”): appended to module path - Single component (e.g., “mnist”): replaces the class name in the path
url – The URL associated with the dataset.
size – The size of the dataset (should be a parsable format).
doi – The DOI of the corresponding paper.
as_prepare – Resources are setup within the method itself
variants – Optional Variants instance (or subclass) that declares the variant space. When provided, callers select a specific variant via a query-style suffix on the id (e.g. "pkg.id[name=x,streaming=true]"); see datamaestro.variants.

Parameters

Parameter	Description
`base`	The base data type class (e.g., `ImageClassification`). Can be inferred from the class hierarchy.
`id`	Override the automatic ID. Use a `"."` prefix to append to the module path (e.g., `".8.topics"` becomes `module.path.8.topics`).
`url`	URL to the dataset’s homepage.
`doi`	DOI of the associated paper.
`timestamp`	Version timestamp for evolving datasets.
`size`	Dataset size (for documentation).
`as_prepare` (deprecated)	If True, the function receives the dataset object for manual resource handling.

ID Override Examples

# Full ID override
@dataset(MyType, id="org.example.custom")
class IgnoredName(MyType):
    ...

# Append suffix to module path (in module gov.nist.trec.adhoc)
@dataset(MyType, id=".8.topics")  # Results in gov.nist.trec.adhoc.8.topics
class Trec8Topics(MyType):
    ...

# Single component suffix (in module com.example)
@dataset(MyType, id=".v2")  # Results in com.example.v2
class Original(MyType):
    ...

# Empty string uses module path only
@dataset(MyType, id="")  # Results in com.example (no class name)
class Main(MyType):
    ...

Resources

Resources are defined as class attributes on dataset classes. See the Download Resources for the full resource API reference and all available resource types.

Single Files

Use FileDownloader for single file downloads:

from datamaestro.download.single import FileDownloader

@dataset(url="http://example.com")
class MyDataset(CSVData):
    DATA = FileDownloader("data.csv", "http://example.com/data.csv")

Archives

Use ZipDownloader or TarDownloader for archives:

from datamaestro.download.archive import ZipDownloader, TarDownloader

@dataset(url="http://example.com")
class ZippedDataset(MyData):
    DATA = ZipDownloader("data", "http://example.com/archive.zip")

@dataset(url="http://example.com")
class TarDataset(MyData):
    DATA = TarDownloader(
        "data", "http://example.com/archive.tar.gz",
        subpath="archive/subdir",
    )

HuggingFace Integration

from datamaestro.download.huggingface import HFDownloader

@dataset(url="https://huggingface.co/datasets/squad")
class Squad(QADataset):
    HF_DATA = HFDownloader("squad_data", "squad")

Referencing Other Datasets

Use reference to declare a dependency on another dataset class. The referenced dataset is prepared automatically when needed:

from datamaestro.download import reference

@dataset(url="http://example.com")
class MyTask(TaskData):
    DOCUMENTS = reference(DocumentsDataset)

    def config(self) -> TaskData:
        return TaskData.C(
            documents=self.DOCUMENTS.config(),
        )

Call .config() on the reference resource to obtain the referenced dataset’s configuration object (this is equivalent to .prepare()).

Note

reference() accepts the dataset class as its first positional argument. The older keyword form reference(reference=DocumentsDataset) still works but is no longer necessary.

Links to Other Datasets (by ID)

Use links() to reference datasets by their string ID rather than by class:

from datamaestro.download.links import links

@dataset(url="http://example.com")
class ExtendedDataset(ExtendedData):
    BASE = links("base", "com.example.base_dataset")

Data Types

Data types define the structure of returned data. They inherit from datamaestro.data.Base and use experimaestro’s configuration system. See the Data Types for full reference.

Built-in Types

datamaestro.data.Base - Base class for all data types
datamaestro.data.File - Single file reference
datamaestro.data.csv.Generic - CSV file
datamaestro.data.csv.Matrix - CSV with numeric data
datamaestro.data.tensor.IDX - IDX tensor format (MNIST)
datamaestro.data.ml.Supervised - Supervised learning data

Custom Data Types

Create custom data types by inheriting from Base. Use Param from experimaestro to define typed parameters:

from experimaestro import Config, Param
from datamaestro.data import Base

class MyCustomData(Base):
    """My custom data type"""
    path: Param[Path]
    """Path to the data file"""

    num_classes: Param[int] = 10
    """Number of classes"""

    def load(self):
        """Load the data"""
        import pandas as pd
        return pd.read_csv(self.path)

Tags and Tasks

Add semantic metadata with datatags() and datatasks() decorators:

from datamaestro.definitions import dataset, datatags, datatasks

@datatags("benchmark", "classification", "vision")
@datatasks("image-classification", "digit-recognition")
@dataset(url="http://example.com")
class MNIST(ImageClassification):
    """Dataset description"""
    TRAIN_IMAGES = FileDownloader(...)
    ...

Tags and tasks are searchable via the CLI:

datamaestro search tag:benchmark
datamaestro search task:classification

Metadatasets

Use metadataset to share common metadata across related datasets:

from datamaestro.definitions import metadataset, dataset

@datatags("trec", "information-retrieval")
@metadataset(IRDataset)
class TRECBase:
    """Common base for TREC datasets"""
    pass

@dataset(TRECBase, url="https://trec.nist.gov/...")
class Robust04(IRDataset):
    ...

@dataset(TRECBase, url="https://trec.nist.gov/...")
class Robust05(IRDataset):
    ...

File Validation

Use HashCheck to validate downloaded files with checksums:

from datamaestro.download.single import FileDownloader
from datamaestro.utils import HashCheck

DATA = FileDownloader(
    "data.csv",
    "http://example.com/data.csv",
    checker=HashCheck("sha256", "abc123...")
)

Testing Datasets

Test your dataset definitions:

def test_my_dataset():
    from datamaestro import prepare_dataset

    ds = prepare_dataset("com.example.mydataset")
    assert ds.train is not None
    assert ds.test is not None

Use pytest with the --datamaestro-download flag to actually download during tests (otherwise downloads are skipped).

Deprecated: Decorator-based Datasets

Deprecated since version The: decorator-based API still works but emits deprecation warnings. Migrate to the class-based approach described above.

The legacy approach uses function decorators to define datasets:

DEPRECATED — use class-based approach instead

from datamaestro.download.single import filedownloader
from datamaestro.definitions import dataset

@filedownloader("train_images.idx", "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz")
@filedownloader("train_labels.idx", "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz")
@filedownloader("test_images.idx", "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz")
@filedownloader("test_labels.idx", "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz")
@dataset(
    ImageClassification,
    url="http://yann.lecun.com/exdb/mnist/",
)
def MNIST(train_images, train_labels, test_images, test_labels):
    """The MNIST database"""
    return {
        "train": LabelledImages(
            images=IDX(path=train_images),
            labels=IDX(path=train_labels)
        ),
        "test": LabelledImages(
            images=IDX(path=test_images),
            labels=IDX(path=test_labels)
        ),
    }

In this legacy pattern:

Download decorators are stacked above @dataset
File paths are passed as arguments to the dataset function
The function returns a dict or data object

See Deprecated: Download Decorators for the full list of deprecated download decorator patterns.