Getting Started
This guide will help you get up and running with datamaestro quickly.
Installation
Install the core package:
pip install datamaestro
Install domain-specific plugins as needed:
pip install datamaestro-text # NLP datasets
pip install datamaestro-image # Image datasets
pip install datamaestro-ml # ML datasets
Basic Usage
Finding Datasets
Use the search command to find datasets:
# Search by name
datamaestro search mnist
# Search by tag
datamaestro search tag:classification
# Search by task
datamaestro search task:image-classification
# Search in a specific repository
datamaestro search repo:image mnist
# Combine search terms (AND)
datamaestro search mnist tag:benchmark
Getting Dataset Information
datamaestro info com.lecun.mnist
Output:
com.lecun.mnist
http://yann.lecun.com/exdb/mnist/
Types (ids): datamaestro_image.data.ImageClassification
Tags: benchmark, classification
Tasks: image-classification
The MNIST database of handwritten digits...
Downloading Datasets
Command Line
# Download only (no preparation)
datamaestro download com.lecun.mnist
# Download and prepare (returns JSON)
datamaestro prepare com.lecun.mnist
Python API
Use prepare_dataset() to download and access datasets:
from datamaestro import prepare_dataset
# Download and prepare the dataset
ds = prepare_dataset("com.lecun.mnist")
# Access training data
train_images = ds.train.images.data() # numpy array
train_labels = ds.train.labels.data() # numpy array
print(f"Training samples: {train_images.shape[0]}")
print(f"Image shape: {train_images.shape[1:]}")
Working with Different Data Types
CSV Data
from datamaestro import prepare_dataset
ds = prepare_dataset("some.csv.dataset")
# Get the file path
csv_path = ds.path
# Or use pandas integration if available
import pandas as pd
df = pd.read_csv(ds.path)
Tensor Data (IDX format)
The IDX type handles MNIST-style IDX files:
ds = prepare_dataset("com.lecun.mnist")
# IDX files are automatically parsed to numpy arrays
images = ds.train.images.data() # Returns numpy array
labels = ds.train.labels.data() # Returns numpy array
HuggingFace Datasets
Some datasets integrate with HuggingFace:
ds = prepare_dataset("some.huggingface.dataset")
# Access the HuggingFace dataset object
hf_dataset = ds.dataset
Using with Experimaestro
Datamaestro integrates seamlessly with experimaestro for experiment management:
from experimaestro import experiment
from datamaestro import prepare_dataset
@experiment()
def my_experiment():
# Datasets are automatically tracked
ds = prepare_dataset("com.lecun.mnist")
# Use in your experiment
train_model(ds.train)
Data Storage Location
By default, datasets are stored in ~/datamaestro/data/. You can change this:
Environment Variable
export DATAMAESTRO_DIR=/path/to/data
datamaestro prepare com.lecun.mnist
Command Line Option
datamaestro --data /path/to/data prepare com.lecun.mnist
Python API
from pathlib import Path
from datamaestro import prepare_dataset
ds = prepare_dataset("com.lecun.mnist", context=Path("/path/to/data"))
Next Steps
Dataset Definition: Learn how to define your own datasets
CLI Reference: Complete command line reference
Configuration: Advanced configuration options
API Reference: Full API documentation