Development Guide
This guide covers how to contribute to datamaestro or develop dataset plugins.
Setting Up the Development Environment
Core Library
# Clone the repository
git clone https://github.com/experimaestro/datamaestro.git
cd datamaestro
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies and set up development environment
uv sync --group dev
# Install pre-commit hooks
uv run pre-commit install
Plugin Development
To develop a plugin (e.g., datamaestro_text):
# Clone the plugin repository
git clone https://github.com/experimaestro/datamaestro_text.git
cd datamaestro_text
# Install with uv (will also install datamaestro as dependency)
uv sync --group dev
Code Quality
Formatting and Linting
Code is formatted and linted with ruff with a maximum line length of 88 characters:
# Check for issues
uv run ruff check src/
# Fix auto-fixable issues
uv run ruff check --fix src/
# Format code
uv run ruff format src/
# Check formatting without changing files
uv run ruff format --check src/
Ruff rules enabled:
E,W- pycodestyle errors and warningsF- PyflakesT20- flake8-print (warns about print statements)
Pre-commit Hooks
All pre-commit hooks:
# Run all hooks on all files
pre-commit run --all-files
# Run hooks on staged files only
pre-commit run
Commit Messages
We use conventional commits:
type(scope): description
[optional body]
[optional footer]
Types:
feat: New featurefix: Bug fixdocs: Documentation onlystyle: Formatting, missing semicolons, etc.refactor: Code change that neither fixes a bug nor adds a featuretest: Adding missing testschore: Maintenance tasks
Examples:
feat(download): add support for HuggingFace datasets
fix(archive): handle corrupted zip files gracefully
docs(readme): update installation instructions
Running Tests
# Run all tests
uv run pytest
# Run a specific test file
uv run pytest src/datamaestro/test/test_record.py
# Run a specific test
uv run pytest src/datamaestro/test/test_record.py -k test_name
# Run with coverage
uv run pytest --cov=datamaestro
# Run with verbose output
uv run pytest -v
Test Fixtures
The conftest.py provides useful fixtures:
def test_with_context(context):
"""Test using temporary datamaestro directory"""
# context is a Context instance with temp directory
ds = context.dataset("com.example.test")
...
Testing Downloads
By default, tests skip actual downloads. To test with real downloads:
pytest --datamaestro-download
Project Structure
datamaestro/
├── src/datamaestro/
│ ├── __init__.py # Package init, version
│ ├── __main__.py # CLI entry point
│ ├── context.py # Context and Repository classes
│ ├── definitions.py # Dataset decorators and models
│ ├── record.py # Record system (deprecated)
│ ├── search.py # Search conditions
│ ├── settings.py # Settings management
│ ├── utils.py # Utility functions
│ ├── sphinx.py # Sphinx documentation extension
│ ├── data/ # Data type definitions
│ │ ├── __init__.py # Base data types
│ │ ├── csv.py # CSV data types
│ │ ├── tensor.py # Tensor data types (IDX)
│ │ ├── ml.py # ML data types
│ │ └── huggingface.py # HuggingFace integration
│ ├── download/ # Download handlers
│ │ ├── __init__.py # Base Download class
│ │ ├── single.py # Single file downloads
│ │ ├── archive.py # Archive extraction
│ │ ├── multiple.py # Multiple file downloads
│ │ ├── links.py # Dataset links
│ │ ├── huggingface.py # HuggingFace downloads
│ │ └── wayback.py # Internet Archive
│ ├── test/ # Tests
│ └── templates/ # Dataset templates
├── docs/ # Sphinx documentation
└── pyproject.toml # Project configuration
Creating a New Repository Plugin
1. Project Structure
datamaestro_myplugin/
├── src/datamaestro_myplugin/
│ ├── __init__.py # Repository class
│ ├── config/ # Dataset definitions
│ │ └── com/
│ │ └── example.py # com.example.* datasets
│ └── data/ # Data type definitions
│ └── __init__.py
├── pyproject.toml
└── README.md
2. Repository Class
# src/datamaestro_myplugin/__init__.py
from datamaestro.context import Repository
class MyPluginRepository(Repository):
NAMESPACE = "myplugin"
DESCRIPTION = "My custom datasets"
3. Entry Point Registration
# pyproject.toml
[project.entry-points."datamaestro.repositories"]
myplugin = "datamaestro_myplugin:MyPluginRepository"
4. Dataset Definition
# src/datamaestro_myplugin/config/com/example.py
from datamaestro.definitions import dataset
from datamaestro.download.single import filedownloader
from datamaestro.data import Base
@filedownloader("data.csv", "http://example.com/data.csv")
@dataset(Base, url="http://example.com")
def my_dataset(data):
"""My example dataset
Description of the dataset.
"""
return Base(path=data)
Adding a New Download Handler
# src/datamaestro/download/myhandler.py
from datamaestro.download import Download
from datamaestro.definitions import DatasetAnnotation
class MyDownload(Download):
def __init__(self, varname: str, url: str):
super().__init__(varname)
self.url = url
def prepare(self):
"""Prepare the resource and return the path/data"""
# Download logic here
path = self.download_file(self.url)
return path
def download(self, force=False):
"""Download the resource"""
# Actual download implementation
...
def mydownloader(varname: str, url: str):
"""Decorator for my custom download handler"""
def decorator(dataset):
download = MyDownload(varname, url)
download.register(dataset)
return dataset
return decorator
Documentation
Building Documentation
cd docs
make html
Output is in docs/build/html/.
Sphinx Extensions
Datamaestro provides a custom Sphinx extension for documenting datasets:
.. dm:repository:: text
.. dm:datasets::
This automatically generates documentation from registered datasets.
Release Process
Update version in
pyproject.tomlUpdate CHANGELOG.md
Create a git tag:
git tag v1.2.3Push with tags:
git push --tagsGitHub Actions will build and publish to PyPI
Troubleshooting
Import Errors
If you get import errors after installing in development mode:
uv sync --reinstall
Pre-commit Hook Failures
If pre-commit hooks fail:
# Update hooks
uv run pre-commit autoupdate
# Clear cache
uv run pre-commit clean
Test Discovery Issues
Ensure test files are named test_*.py and are in the src/datamaestro/test/ directory.