Getting Started

This guide will help you get up and running with datamaestro quickly.

Installation

Install the core package:

pip install datamaestro

Install domain-specific plugins as needed:

pip install datamaestro-text   # NLP datasets
pip install datamaestro-image  # Image datasets
pip install datamaestro-ml     # ML datasets

Basic Usage

Finding Datasets

Use the search command to find datasets:

# Search by name
datamaestro search mnist

# Search by tag
datamaestro search tag:classification

# Search by task
datamaestro search task:image-classification

# Search in a specific repository
datamaestro search repo:image mnist

# Combine search terms (AND)
datamaestro search mnist tag:benchmark

Getting Dataset Information

datamaestro info com.lecun.mnist

Output:

com.lecun.mnist
http://yann.lecun.com/exdb/mnist/
Types (ids): datamaestro_image.data.ImageClassification
Tags: benchmark, classification
Tasks: image-classification

The MNIST database of handwritten digits...

Downloading Datasets

Command Line

# Download only (no preparation)
datamaestro download com.lecun.mnist

# Download and prepare (returns JSON)
datamaestro prepare com.lecun.mnist

Python API

Use prepare_dataset() to download and access datasets:

from datamaestro import prepare_dataset

# Download and prepare the dataset
ds = prepare_dataset("com.lecun.mnist")

# Access training data
train_images = ds.train.images.data()  # numpy array
train_labels = ds.train.labels.data()  # numpy array

print(f"Training samples: {train_images.shape[0]}")
print(f"Image shape: {train_images.shape[1:]}")

Working with Different Data Types

CSV Data

from datamaestro import prepare_dataset

ds = prepare_dataset("some.csv.dataset")

# Get the file path
csv_path = ds.path

# Or use pandas integration if available
import pandas as pd
df = pd.read_csv(ds.path)

Tensor Data (IDX format)

The IDX type handles MNIST-style IDX files:

ds = prepare_dataset("com.lecun.mnist")

# IDX files are automatically parsed to numpy arrays
images = ds.train.images.data()  # Returns numpy array
labels = ds.train.labels.data()  # Returns numpy array

HuggingFace Datasets

Some datasets integrate with HuggingFace:

ds = prepare_dataset("some.huggingface.dataset")

# Access the HuggingFace dataset object
hf_dataset = ds.dataset

Using with Experimaestro

Datamaestro integrates seamlessly with experimaestro for experiment management:

from experimaestro import experiment
from datamaestro import prepare_dataset

@experiment()
def my_experiment():
    # Datasets are automatically tracked
    ds = prepare_dataset("com.lecun.mnist")

    # Use in your experiment
    train_model(ds.train)

Data Storage Location

By default, datasets are stored in ~/datamaestro/data/. You can change this:

Environment Variable

export DATAMAESTRO_DIR=/path/to/data
datamaestro prepare com.lecun.mnist

Command Line Option

datamaestro --data /path/to/data prepare com.lecun.mnist

Python API

from pathlib import Path
from datamaestro import prepare_dataset

ds = prepare_dataset("com.lecun.mnist", context=Path("/path/to/data"))

Next Steps