Dataset Definition
A dataset definition in datamaestro combines declarative metadata with imperative data processing logic. This page explains how to create your own dataset definitions.
Components of a Dataset
Every dataset definition includes:
ID: Unique identifier determined by module location and class/function name
Meta-information: Tags, tasks, URL, description
Resources: What files/data to fetch (defined as class attributes)
Data access: How to structure the data in Python
Class-based Datasets (Preferred)
The preferred way to define datasets uses class-based definitions where resources are declared as class attributes. The framework automatically detects resources and builds a dependency DAG.
Basic Example
datamaestro_image/config/com/lecun.pyfrom datamaestro_image.data import ImageClassification, LabelledImages
from datamaestro.data.tensor import IDX
from datamaestro.download.single import FileDownloader
from datamaestro.definitions import Dataset, dataset
@dataset(url="http://yann.lecun.com/exdb/mnist/")
class MNIST(Dataset):
"""The MNIST database of handwritten digits."""
TRAIN_IMAGES = FileDownloader(
"train_images.idx",
"http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz",
)
TRAIN_LABELS = FileDownloader(
"train_labels.idx",
"http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz",
)
TEST_IMAGES = FileDownloader(
"test_images.idx",
"http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz",
)
TEST_LABELS = FileDownloader(
"test_labels.idx",
"http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz",
)
def config(self) -> ImageClassification:
return ImageClassification.C(
train=LabelledImages(
images=IDX(path=self.TRAIN_IMAGES.path),
labels=IDX(path=self.TRAIN_LABELS.path),
),
test=LabelledImages(
images=IDX(path=self.TEST_IMAGES.path),
labels=IDX(path=self.TEST_LABELS.path),
),
)
Advantages of class-based definitions:
Explicit pipeline — dependencies between resources are visible
Transient intermediaries — intermediate files can be deleted after processing
Auto-naming — resource names are auto-detected from class attribute names
Two-path safety — incomplete downloads never appear at the final path
Dataset Utilities
Dataset.data_pathA class property that returns the
Pathwhere this dataset’s data is stored on disk. This is a shortcut forMyDataset.__dataset__.datapath:@dataset(url="http://example.com") class Documents(Dataset): ... @dataset(url="http://example.com") class MyTask(Dataset): DOCS = reference(Documents) def config(self) -> TaskData: # Use Documents.data_path to locate sibling data store_path = Documents.data_path / "docstore" return TaskData.C(path=store_path)
Resource Pipelines
Resources can depend on other resources, forming a processing pipeline:
@dataset(url="http://example.com")
class ProcessedDataset(MyData):
# Raw download — deleted after processing completes
RAW = FileDownloader(
"raw.gz", "http://example.com/data.gz",
transient=True,
)
# Processed output — kept permanently
PROCESSED = MyProcessor.from_file(RAW)
@classmethod
def __create_dataset__(cls, dataset: AbstractDataset):
return cls.C(path=cls.PROCESSED.path)
The transient=True flag tells the framework to delete intermediate data
once all downstream resources are COMPLETE.
Dataset ID Naming
The dataset ID is derived from:
Module path:
datamaestro_image.config.com.lecun→com.lecunClass/function name: CamelCase is converted to dot-separated lowercase (e.g.,
MNIST→mnist,ProcessedMNIST→processed.mnist,TrainTexttripleFull→train.texttriple.full)Final ID:
com.lecun.mnist
The convention follows reversed domain names (like Java packages):
com.lecun.mnistfor http://yann.lecun.com/exdb/mnist/org.trec.robust04for https://trec.nist.gov/ ROBUST04 trackio.huggingface.squadfor HuggingFace datasets
The @dataset Annotation
The @dataset decorator is the main annotation for defining datasets.
- class datamaestro.definitions.dataset(base=None, *, timestamp: str | None = None, id: None | str = None, url: None | str = None, size: None | int | str = None, doi: None | str = None, as_prepare: bool = False, variants: Variants | type | None = None)
Dataset decorator
Meta-datasets are not associated with any base type.
- Parameters:
base – The base type (or None if inferred from type annotation).
timestamp – If the dataset evolves, specify its timestamp.
id – Dataset ID override. Behavior depends on format: - Full ID (e.g., “com.example.data”): used as-is if it has 3+ components - Suffix with dot prefix (e.g., “.8.topics”): appended to module path - Single component (e.g., “mnist”): replaces the class name in the path
url – The URL associated with the dataset.
size – The size of the dataset (should be a parsable format).
doi – The DOI of the corresponding paper.
as_prepare – Resources are setup within the method itself
variants – Optional
Variantsinstance (or subclass) that declares the variant space. When provided, callers select a specific variant via a query-style suffix on the id (e.g."pkg.id[name=x,streaming=true]"); seedatamaestro.variants.
Parameters
Parameter |
Description |
|---|---|
|
The base data type class (e.g., |
|
Override the automatic ID. Use a |
|
URL to the dataset’s homepage. |
|
DOI of the associated paper. |
|
Version timestamp for evolving datasets. |
|
Dataset size (for documentation). |
|
If True, the function receives the dataset object for manual resource handling. |
ID Override Examples
# Full ID override
@dataset(MyType, id="org.example.custom")
class IgnoredName(MyType):
...
# Append suffix to module path (in module gov.nist.trec.adhoc)
@dataset(MyType, id=".8.topics") # Results in gov.nist.trec.adhoc.8.topics
class Trec8Topics(MyType):
...
# Single component suffix (in module com.example)
@dataset(MyType, id=".v2") # Results in com.example.v2
class Original(MyType):
...
# Empty string uses module path only
@dataset(MyType, id="") # Results in com.example (no class name)
class Main(MyType):
...
Resources
Resources are defined as class attributes on dataset classes. See the Download Resources for the full resource API reference and all available resource types.
Single Files
Use FileDownloader for single file downloads:
from datamaestro.download.single import FileDownloader
@dataset(url="http://example.com")
class MyDataset(CSVData):
DATA = FileDownloader("data.csv", "http://example.com/data.csv")
Archives
Use ZipDownloader or
TarDownloader for archives:
from datamaestro.download.archive import ZipDownloader, TarDownloader
@dataset(url="http://example.com")
class ZippedDataset(MyData):
DATA = ZipDownloader("data", "http://example.com/archive.zip")
@dataset(url="http://example.com")
class TarDataset(MyData):
DATA = TarDownloader(
"data", "http://example.com/archive.tar.gz",
subpath="archive/subdir",
)
HuggingFace Integration
from datamaestro.download.huggingface import HFDownloader
@dataset(url="https://huggingface.co/datasets/squad")
class Squad(QADataset):
HF_DATA = HFDownloader("squad_data", "squad")
Referencing Other Datasets
Use reference to declare a dependency on
another dataset class. The referenced dataset is prepared automatically when
needed:
Call .config() on the reference resource to obtain the referenced dataset’s
configuration object (this is equivalent to .prepare()).
Note
reference() accepts the dataset class as its first positional argument.
The older keyword form reference(reference=DocumentsDataset) still works
but is no longer necessary.
Links to Other Datasets (by ID)
Use links() to reference datasets by their
string ID rather than by class:
from datamaestro.download.links import links
@dataset(url="http://example.com")
class ExtendedDataset(ExtendedData):
BASE = links("base", "com.example.base_dataset")
Data Types
Data types define the structure of returned data. They inherit from
datamaestro.data.Base and use experimaestro’s configuration system.
See the Data Types for full reference.
Built-in Types
datamaestro.data.Base- Base class for all data typesdatamaestro.data.File- Single file referencedatamaestro.data.csv.Generic- CSV filedatamaestro.data.csv.Matrix- CSV with numeric datadatamaestro.data.tensor.IDX- IDX tensor format (MNIST)datamaestro.data.ml.Supervised- Supervised learning data
Custom Data Types
Create custom data types by inheriting from Base.
Use Param from experimaestro to define typed parameters:
from experimaestro import Config, Param
from datamaestro.data import Base
class MyCustomData(Base):
"""My custom data type"""
path: Param[Path]
"""Path to the data file"""
num_classes: Param[int] = 10
"""Number of classes"""
def load(self):
"""Load the data"""
import pandas as pd
return pd.read_csv(self.path)
Metadatasets
Use metadataset to share common metadata
across related datasets:
from datamaestro.definitions import metadataset, dataset
@datatags("trec", "information-retrieval")
@metadataset(IRDataset)
class TRECBase:
"""Common base for TREC datasets"""
pass
@dataset(TRECBase, url="https://trec.nist.gov/...")
class Robust04(IRDataset):
...
@dataset(TRECBase, url="https://trec.nist.gov/...")
class Robust05(IRDataset):
...
File Validation
Use HashCheck to validate downloaded files with checksums:
from datamaestro.download.single import FileDownloader
from datamaestro.utils import HashCheck
DATA = FileDownloader(
"data.csv",
"http://example.com/data.csv",
checker=HashCheck("sha256", "abc123...")
)
Testing Datasets
Test your dataset definitions:
def test_my_dataset():
from datamaestro import prepare_dataset
ds = prepare_dataset("com.example.mydataset")
assert ds.train is not None
assert ds.test is not None
Use pytest with the --datamaestro-download flag to actually download
during tests (otherwise downloads are skipped).
Deprecated: Decorator-based Datasets
Deprecated since version The: decorator-based API still works but emits deprecation warnings. Migrate to the class-based approach described above.
The legacy approach uses function decorators to define datasets:
from datamaestro.download.single import filedownloader
from datamaestro.definitions import dataset
@filedownloader("train_images.idx", "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz")
@filedownloader("train_labels.idx", "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz")
@filedownloader("test_images.idx", "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz")
@filedownloader("test_labels.idx", "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz")
@dataset(
ImageClassification,
url="http://yann.lecun.com/exdb/mnist/",
)
def MNIST(train_images, train_labels, test_images, test_labels):
"""The MNIST database"""
return {
"train": LabelledImages(
images=IDX(path=train_images),
labels=IDX(path=train_labels)
),
"test": LabelledImages(
images=IDX(path=test_images),
labels=IDX(path=test_labels)
),
}
In this legacy pattern:
Download decorators are stacked above
@datasetFile paths are passed as arguments to the dataset function
The function returns a dict or data object
See Deprecated: Download Decorators for the full list of deprecated download decorator patterns.