API Reference
This section documents the datamaestro Python API. See also:
Data Types - Dataset content structures
Download Decorators - Resource fetching
Dataset Variants - Variant axes and query-style selectors
Records - Heterogeneous containers (deprecated)
Core Functions
prepare_dataset
The main entry point for using datasets in Python:
from datamaestro import prepare_dataset
# By dataset ID
ds = prepare_dataset("com.lecun.mnist")
# With custom data directory
from pathlib import Path
ds = prepare_dataset("com.lecun.mnist", context=Path("/custom/path"))
- datamaestro.context.prepare_dataset(dataset_id: str | DatasetWrapper | Config, context: Context | Path | None = None, *, variant: Dict | None = None)
Find a dataset given its id and download the resources.
Variants can be selected two ways:
Inline in the id:
prepare_dataset("pkg.id[k=v,...]").As a dict kwarg:
prepare_dataset("pkg.id", variant={"k": v, ...}).
Both forms route the kwargs through to the wrapper’s
prepare()call (defaults are filled in by the dataset’sVariants). Passing both a selector andvariantraisesValueError.When called inside an active experimaestro experiment context, the download is deferred: the returned config is an
experimaestro.Prepareinstance, and the framework calls.prepare()lazily as an in-memory dependency before any task that references the dataset runs. Outside an experiment context (e.g. notebook, standalone script), the download is performed eagerly to preserve the previous behavior.
find_dataset
Find a dataset without downloading:
get_dataset
Get a dataset without downloading (assumes already downloaded):
from datamaestro import get_dataset
ds = get_dataset("com.lecun.mnist")
- datamaestro.context.get_dataset(dataset_id: str, *, variant: Dict | None = None)
Find a dataset given its id (without downloading).
Like
prepare_dataset(), accepts either"pkg.id[k=v,...]"or avariant={"k": v, ...}dict kwarg.
Context
The Context class manages global state:
- class datamaestro.context.Context(path: Path = None)
Represents the application context
- dataset(datasetid) AbstractDataset
Get a dataset by ID.
datasetidmay be a variant-selector form ("pkg.id[k=v,...]"); the selector is ignored here (Context.datasetreturns the family wrapper). Useprepare_dataset()to route variant kwargs through toprepare().
- datasets()
Returns an iterator over all files
- repositories() Iterable[Repository]
Returns an iterator over repositories
Dataset Classes
AbstractDataset
Base class for all dataset definitions:
- class datamaestro.definitions.AbstractDataset(repository: 'Repository' | None)
Specialization of AbstractData for datasets
A dataset:
has a unique ID (and aliases)
can be searched for
has a data storage space
- has specific attributes:
timestamp: whether the dataset version depends on the time of the download
- download(force=False)
Download all the necessary resources.
Uses DAG-based topological ordering and the two-path system: 1. Acquire exclusive lock (.state.lock) 2. Resource writes to transient_path (under .downloads/) 3. Framework moves transient_path → path (main folder) 4. State marked COMPLETE 5. Transient dependencies cleaned up eagerly 6. .downloads/ directory removed after all resources complete 7. Release lock
DatasetWrapper
The standard dataset wrapper created by the @dataset decorator:
- class datamaestro.definitions.DatasetWrapper(annotation: dataset, t: type)
Wraps an annotated method into a dataset
This is the standard way to define a dataset in datamaestro through annotations (otherwise, derive from AbstractDataset).
- property datapath
Returns the destination path for downloads
- download(force=False)
Download all the necessary resources.
Uses DAG-based topological ordering and the two-path system: 1. Acquire exclusive lock (.state.lock) 2. Resource writes to transient_path (under .downloads/) 3. Framework moves transient_path → path (main folder) 4. State marked COMPLETE 5. Transient dependencies cleaned up eagerly 6. .downloads/ directory removed after all resources complete 7. Release lock
Decorators
@dataset
Main decorator for defining datasets (see Dataset Definition for details):
- class datamaestro.definitions.dataset(base=None, *, timestamp: str | None = None, id: None | str = None, url: None | str = None, size: None | int | str = None, doi: None | str = None, as_prepare: bool = False, variants: Variants | type | None = None)
Dataset decorator
Meta-datasets are not associated with any base type.
- Parameters:
base – The base type (or None if inferred from type annotation).
timestamp – If the dataset evolves, specify its timestamp.
id – Dataset ID override. Behavior depends on format: - Full ID (e.g., “com.example.data”): used as-is if it has 3+ components - Suffix with dot prefix (e.g., “.8.topics”): appended to module path - Single component (e.g., “mnist”): replaces the class name in the path
url – The URL associated with the dataset.
size – The size of the dataset (should be a parsable format).
doi – The DOI of the corresponding paper.
as_prepare – Resources are setup within the method itself
variants – Optional
Variantsinstance (or subclass) that declares the variant space. When provided, callers select a specific variant via a query-style suffix on the id (e.g."pkg.id[name=x,streaming=true]"); seedatamaestro.variants.
@metadataset
Decorator for abstract/shared dataset definitions:
- class datamaestro.definitions.metadataset(base)
Annotation for object/functions which are abstract dataset definitions
i.e. shared by more than one real dataset. This is useful to share tags, urls, etc.
Repository
- class datamaestro.context.Repository(context: Context)
(deprecated) Repository where datasets are located in __module__.config
- search(name: str)
Search for a dataset in the definitions.
Accepts either a bare id (
"pkg.id") or a variant-selector form ("pkg.id[k=v,...]"). The selector suffix is stripped when matching against aliases; callers that need the parsed variant kwargs should usefind_dataset()/prepare_dataset()(which return the resolved config).
Search Conditions
For programmatic dataset search:
from datamaestro.search import Condition, AndCondition, TagCondition
# Parse search term
condition = Condition.parse("tag:classification")
# Build complex queries
query = AndCondition()
query.append(TagCondition("classification"))
query.append(Condition.parse("repo:image"))
# Match datasets
for ds in context.datasets():
if query.match(ds):
print(ds.id)
- class datamaestro.search.Condition
- class datamaestro.search.AndCondition
- class datamaestro.search.TagCondition(regex)
- class datamaestro.search.TaskCondition(regex)
- class datamaestro.search.RepositoryCondition(regex)