Configuration
Datamaestro uses a hierarchical configuration system with global settings, user preferences, and per-invocation options.
Data Directory
The main data directory stores all downloaded datasets. Default location: ~/datamaestro/
Setting the Data Directory
Environment variable (recommended for system-wide configuration):
export DATAMAESTRO_DIR=/path/to/data
Command line option:
datamaestro --data /path/to/data prepare com.lecun.mnist
Python API:
from pathlib import Path
from datamaestro import prepare_dataset
ds = prepare_dataset("com.lecun.mnist", context=Path("/path/to/data"))
Directory Structure
~/datamaestro/
├── data/ # Downloaded datasets
│ ├── image/ # Image repository datasets
│ │ └── com/lecun/mnist/
│ ├── text/ # Text repository datasets
│ └── ml/ # ML repository datasets
├── cache/ # Temporary download cache
└── settings.json # Global settings
Settings Files
Global Settings
Located at $DATAMAESTRO_DIR/settings.json:
{
"datafolders": {
"large_data": "/mnt/storage/datasets",
"local_cache": "/tmp/datamaestro"
}
}
User Settings
Located at ~/.config/datamaestro/user.json:
{
"default_repository": "text"
}
Data Folders
Data folders allow datasets to reference pre-existing data locations without copying files.
Configuring Data Folders
Command line:
# Set a data folder
datamaestro datafolders set my_data /path/to/existing/data
# List configured folders
datamaestro datafolders list
In settings.json:
{
"datafolders": {
"my_data": "/path/to/existing/data"
}
}
Using Data Folders in Dataset Definitions
Use DatafolderPath to reference configured folders:
Context API
The Context class manages all configuration state:
from datamaestro.context import Context
# Get the singleton context instance
ctx = Context.instance()
# Access paths
print(ctx.datapath) # ~/datamaestro/data
print(ctx.cachepath) # ~/datamaestro/cache
# Access settings
print(ctx.settings.datafolders)
# Iterate over repositories
for repo in ctx.repositories():
print(repo.id, repo.name)
# Find a dataset
ds = ctx.dataset("com.lecun.mnist")
Caching
Downloaded files are cached in $DATAMAESTRO_DIR/cache/ to avoid re-downloading.
Cache Behavior
Files are identified by URL hash
Cache is checked before downloading
Use
--keep-downloadsto preserve archive files after extraction
Clearing the Cache
rm -rf ~/datamaestro/cache/*
Remote Execution
Datamaestro supports remote execution via experimaestro’s rpyc integration:
datamaestro --host remote-server --pythonpath /usr/bin/python3 prepare com.lecun.mnist
This connects to the remote host and executes datamaestro there, useful for:
Downloading data directly to a compute cluster
Accessing datasets on remote storage
Repository Registration
Repositories are registered via Python entry points in pyproject.toml:
[project.entry-points."datamaestro.repositories"]
myrepo = "mypackage:MyRepository"
The repository class must inherit from Repository:
from datamaestro.context import Repository
class MyRepository(Repository):
NAMESPACE = "myrepo"
DESCRIPTION = "My custom datasets"