API Reference

This section documents the datamaestro Python API. See also:

Core Functions

prepare_dataset

The main entry point for using datasets in Python:

from datamaestro import prepare_dataset

# By dataset ID
ds = prepare_dataset("com.lecun.mnist")

# With custom data directory
from pathlib import Path
ds = prepare_dataset("com.lecun.mnist", context=Path("/custom/path"))
datamaestro.context.prepare_dataset(dataset_id: str | DatasetWrapper | Config, context: Context | Path | None = None, *, variant: Dict | None = None)

Find a dataset given its id and download the resources.

Variants can be selected two ways:

  • Inline in the id: prepare_dataset("pkg.id[k=v,...]").

  • As a dict kwarg: prepare_dataset("pkg.id", variant={"k": v, ...}).

Both forms route the kwargs through to the wrapper’s prepare() call (defaults are filled in by the dataset’s Variants). Passing both a selector and variant raises ValueError.

When called inside an active experimaestro experiment context, the download is deferred: the returned config is an experimaestro.Prepare instance, and the framework calls .prepare() lazily as an in-memory dependency before any task that references the dataset runs. Outside an experiment context (e.g. notebook, standalone script), the download is performed eagerly to preserve the previous behavior.

find_dataset

Find a dataset without downloading:

from datamaestro import find_dataset

ds = find_dataset("com.lecun.mnist")
print(ds.url)  # Dataset URL
print(ds.description)  # Dataset description
datamaestro.context.find_dataset(dataset_id: str)

Find a dataset given its id.

Accepts a plain id or a variant-selector form ("pkg.id[k=v,...]"). The selector only affects downstream prepare(); find_dataset returns the family wrapper.

get_dataset

Get a dataset without downloading (assumes already downloaded):

from datamaestro import get_dataset

ds = get_dataset("com.lecun.mnist")
datamaestro.context.get_dataset(dataset_id: str, *, variant: Dict | None = None)

Find a dataset given its id (without downloading).

Like prepare_dataset(), accepts either "pkg.id[k=v,...]" or a variant={"k": v, ...} dict kwarg.

Context

The Context class manages global state:

from datamaestro.context import Context

ctx = Context.instance()

# Access paths
ctx.datapath      # ~/datamaestro/data
ctx.cachepath     # ~/datamaestro/cache

# Iterate over datasets
for ds in ctx.datasets():
    print(ds.id)

# Get a specific dataset
ds = ctx.dataset("com.lecun.mnist")
class datamaestro.context.Context(path: Path = None)

Represents the application context

dataset(datasetid) AbstractDataset

Get a dataset by ID.

datasetid may be a variant-selector form ("pkg.id[k=v,...]"); the selector is ignored here (Context.dataset returns the family wrapper). Use prepare_dataset() to route variant kwargs through to prepare().

datasets()

Returns an iterator over all files

repositories() Iterable[Repository]

Returns an iterator over repositories

Dataset Classes

AbstractDataset

Base class for all dataset definitions:

class datamaestro.definitions.AbstractDataset(repository: 'Repository' | None)

Specialization of AbstractData for datasets

A dataset:

  • has a unique ID (and aliases)

  • can be searched for

  • has a data storage space

  • has specific attributes:
    • timestamp: whether the dataset version depends on the time of the download

download(force=False)

Download all the necessary resources.

Uses DAG-based topological ordering and the two-path system: 1. Acquire exclusive lock (.state.lock) 2. Resource writes to transient_path (under .downloads/) 3. Framework moves transient_path → path (main folder) 4. State marked COMPLETE 5. Transient dependencies cleaned up eagerly 6. .downloads/ directory removed after all resources complete 7. Release lock

DatasetWrapper

The standard dataset wrapper created by the @dataset decorator:

class datamaestro.definitions.DatasetWrapper(annotation: dataset, t: type)

Wraps an annotated method into a dataset

This is the standard way to define a dataset in datamaestro through annotations (otherwise, derive from AbstractDataset).

property datapath

Returns the destination path for downloads

download(force=False)

Download all the necessary resources.

Uses DAG-based topological ordering and the two-path system: 1. Acquire exclusive lock (.state.lock) 2. Resource writes to transient_path (under .downloads/) 3. Framework moves transient_path → path (main folder) 4. State marked COMPLETE 5. Transient dependencies cleaned up eagerly 6. .downloads/ directory removed after all resources complete 7. Release lock

Decorators

@dataset

Main decorator for defining datasets (see Dataset Definition for details):

class datamaestro.definitions.dataset(base=None, *, timestamp: str | None = None, id: None | str = None, url: None | str = None, size: None | int | str = None, doi: None | str = None, as_prepare: bool = False, variants: Variants | type | None = None)

Dataset decorator

Meta-datasets are not associated with any base type.

Parameters:
  • base – The base type (or None if inferred from type annotation).

  • timestamp – If the dataset evolves, specify its timestamp.

  • id – Dataset ID override. Behavior depends on format: - Full ID (e.g., “com.example.data”): used as-is if it has 3+ components - Suffix with dot prefix (e.g., “.8.topics”): appended to module path - Single component (e.g., “mnist”): replaces the class name in the path

  • url – The URL associated with the dataset.

  • size – The size of the dataset (should be a parsable format).

  • doi – The DOI of the corresponding paper.

  • as_prepare – Resources are setup within the method itself

  • variants – Optional Variants instance (or subclass) that declares the variant space. When provided, callers select a specific variant via a query-style suffix on the id (e.g. "pkg.id[name=x,streaming=true]"); see datamaestro.variants.

@metadataset

Decorator for abstract/shared dataset definitions:

class datamaestro.definitions.metadataset(base)

Annotation for object/functions which are abstract dataset definitions

i.e. shared by more than one real dataset. This is useful to share tags, urls, etc.

@datatags / @datatasks

Add semantic metadata:

from datamaestro.definitions import datatags, datatasks

@datatags("benchmark", "classification")
@datatasks("image-classification")
@dataset(MyType)
def my_dataset():
    ...

Repository

class datamaestro.context.Repository(context: Context)

(deprecated) Repository where datasets are located in __module__.config

search(name: str)

Search for a dataset in the definitions.

Accepts either a bare id ("pkg.id") or a variant-selector form ("pkg.id[k=v,...]"). The selector suffix is stripped when matching against aliases; callers that need the parsed variant kwargs should use find_dataset() / prepare_dataset() (which return the resolved config).

Search Conditions

For programmatic dataset search:

from datamaestro.search import Condition, AndCondition, TagCondition

# Parse search term
condition = Condition.parse("tag:classification")

# Build complex queries
query = AndCondition()
query.append(TagCondition("classification"))
query.append(Condition.parse("repo:image"))

# Match datasets
for ds in context.datasets():
    if query.match(ds):
        print(ds.id)
class datamaestro.search.Condition
class datamaestro.search.AndCondition
class datamaestro.search.TagCondition(regex)
class datamaestro.search.TaskCondition(regex)
class datamaestro.search.RepositoryCondition(regex)
class datamaestro.search.TypeCondition(typename: str)