Datamaestro
User Guide
Dataset Development
API Reference
Overview
Datamaestro is a Python framework for managing, organizing, and downloading datasets. It provides:
Dataset Registry: A reference for available resources with qualified names (e.g.,
com.lecun.mnist)Automatic Downloads: Tools to automatically download and preprocess resources when freely available
Resource Pipelines: A DAG-based resource system with dependency tracking, transient intermediaries, and two-path download safety
Experimaestro Integration: Seamless integration with the experimaestro experiment manager
Extensible Architecture: Plugin system for domain-specific datasets
Each dataset is uniquely identified by a qualified name derived from the website’s domain (e.g., com.lecun.mnist for MNIST from yann.lecun.com).
Installation
pip install datamaestro
For domain-specific datasets, install the corresponding plugins:
# NLP and information retrieval datasets
pip install datamaestro-text
# Image datasets (MNIST, CIFAR, etc.)
pip install datamaestro-image
# Generic machine learning datasets
pip install datamaestro-ml
Quick Start
Command Line
# Search for datasets
datamaestro search mnist
# Get information about a dataset
datamaestro info com.lecun.mnist
# Download and prepare a dataset
datamaestro prepare com.lecun.mnist
Python API
Use prepare_dataset() to download and access datasets:
from datamaestro import prepare_dataset
# Download and get the dataset
ds = prepare_dataset("com.lecun.mnist")
# Access the data
print(ds.train.images.data().shape) # (60000, 28, 28)
print(ds.test.labels.data().shape) # (10000,)
Available Repositories
The main datamaestro package provides generic processing capabilities. Domain-specific datasets are provided through plugins:
Repository |
Description |
Install |
|---|---|---|
NLP and information retrieval datasets |
|
|
Image datasets (MNIST, CIFAR, etc.) |
|
|
Generic ML datasets |
|
Detailed Example
Python Definition of Datasets
Datasets are defined as Python classes with resource attributes that describe how to download and process data:
from datamaestro_image.data import ImageClassification, LabelledImages
from datamaestro.data.tensor import IDX
from datamaestro.download.single import FileDownloader
from datamaestro.definitions import Dataset, dataset
@dataset(url="http://yann.lecun.com/exdb/mnist/")
class MNIST(Dataset):
"""The MNIST database of handwritten digits."""
TRAIN_IMAGES = FileDownloader(
"train_images.idx",
"http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz",
)
TRAIN_LABELS = FileDownloader(
"train_labels.idx",
"http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz",
)
TEST_IMAGES = FileDownloader(
"test_images.idx",
"http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz",
)
TEST_LABELS = FileDownloader(
"test_labels.idx",
"http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz",
)
def config(self) -> ImageClassification:
return ImageClassification.C(
train=LabelledImages(
images=IDX(path=self.TRAIN_IMAGES.path),
labels=IDX(path=self.TRAIN_LABELS.path),
),
test=LabelledImages(
images=IDX(path=self.TEST_IMAGES.path),
labels=IDX(path=self.TEST_LABELS.path),
),
)
Resources are automatically discovered from class attributes and form a dependency graph. The framework handles:
Two-path downloads: writes to a temporary path, moves to final path on success
State tracking: resource states (NONE/PARTIAL/COMPLETE) persisted in
.state.jsonTransient cleanup: intermediate files deleted after all dependents complete
Retrieve and Download
The command line interface downloads resources automatically:
$ datamaestro search mnist
com.lecun.mnist
$ datamaestro prepare com.lecun.mnist
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
...
The prepare command outputs JSON with file paths:
{
"train": {
"images": {"path": "/home/user/datamaestro/data/image/com/lecun/mnist/train-images"},
"labels": {"path": "/home/user/datamaestro/data/image/com/lecun/mnist/train-labels"}
},
"test": {
"images": {"path": "/home/user/datamaestro/data/image/com/lecun/mnist/t10k-images"},
"labels": {"path": "/home/user/datamaestro/data/image/com/lecun/mnist/t10k-labels"}
},
"id": "com.lecun.mnist"
}
Using in Python
from datamaestro import prepare_dataset
ds = prepare_dataset("com.lecun.mnist")
# Access numpy arrays directly (for IDX format)
print(ds.train.images.data().dtype) # uint8
print(ds.train.images.data().shape) # (60000, 28, 28)
print(ds.train.labels.data().shape) # (60000,)
Key Concepts
Dataset ID: Qualified name like
com.lecun.mnistderived from the source URLRepository: A collection of related datasets (e.g.,
datamaestro_image) - seeRepositoryResources: Steps in a download/processing pipeline (files, archives, links) - see Download Resources
Data Types: Structured representations of data (CSV, tensors, etc.) - see
BaseContext: Global configuration for data paths and settings - see
Context
See the documentation sections for detailed information on each concept.