# Development Guide

This guide covers how to contribute to datamaestro or develop dataset plugins.

## Setting Up the Development Environment

### Core Library

```bash
# Clone the repository
git clone https://github.com/experimaestro/datamaestro.git
cd datamaestro

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies and set up development environment
uv sync --group dev

# Install pre-commit hooks
uv run pre-commit install
```

### Plugin Development

To develop a plugin (e.g., `datamaestro_text`):

```bash
# Clone the plugin repository
git clone https://github.com/experimaestro/datamaestro_text.git
cd datamaestro_text

# Install with uv (will also install datamaestro as dependency)
uv sync --group dev
```

## Code Quality

### Formatting and Linting

Code is formatted and linted with [ruff](https://docs.astral.sh/ruff/) with a maximum line length of 88 characters:

```bash
# Check for issues
uv run ruff check src/

# Fix auto-fixable issues
uv run ruff check --fix src/

# Format code
uv run ruff format src/

# Check formatting without changing files
uv run ruff format --check src/
```

Ruff rules enabled:
- `E`, `W` - pycodestyle errors and warnings
- `F` - Pyflakes
- `T20` - flake8-print (warns about print statements)

### Pre-commit Hooks

All pre-commit hooks:

```bash
# Run all hooks on all files
pre-commit run --all-files

# Run hooks on staged files only
pre-commit run
```

### Commit Messages

We use [conventional commits](https://www.conventionalcommits.org/):

```
type(scope): description

[optional body]

[optional footer]
```

Types:
- `feat`: New feature
- `fix`: Bug fix
- `docs`: Documentation only
- `style`: Formatting, missing semicolons, etc.
- `refactor`: Code change that neither fixes a bug nor adds a feature
- `test`: Adding missing tests
- `chore`: Maintenance tasks

Examples:
```
feat(download): add support for HuggingFace datasets
fix(archive): handle corrupted zip files gracefully
docs(readme): update installation instructions
```

## Running Tests

```bash
# Run all tests
uv run pytest

# Run a specific test file
uv run pytest src/datamaestro/test/test_record.py

# Run a specific test
uv run pytest src/datamaestro/test/test_record.py -k test_name

# Run with coverage
uv run pytest --cov=datamaestro

# Run with verbose output
uv run pytest -v
```

### Test Fixtures

The `conftest.py` provides useful fixtures:

```python
def test_with_context(context):
    """Test using temporary datamaestro directory"""
    # context is a Context instance with temp directory
    ds = context.dataset("com.example.test")
    ...
```

### Testing Downloads

By default, tests skip actual downloads. To test with real downloads:

```bash
pytest --datamaestro-download
```

## Project Structure

```
datamaestro/
├── src/datamaestro/
│   ├── __init__.py          # Package init, version
│   ├── __main__.py          # CLI entry point
│   ├── context.py           # Context and Repository classes
│   ├── definitions.py       # Dataset decorators and models
│   ├── record.py            # Record system (deprecated)
│   ├── search.py            # Search conditions
│   ├── settings.py          # Settings management
│   ├── utils.py             # Utility functions
│   ├── sphinx.py            # Sphinx documentation extension
│   ├── data/                # Data type definitions
│   │   ├── __init__.py      # Base data types
│   │   ├── csv.py           # CSV data types
│   │   ├── tensor.py        # Tensor data types (IDX)
│   │   ├── ml.py            # ML data types
│   │   └── huggingface.py   # HuggingFace integration
│   ├── download/            # Download handlers
│   │   ├── __init__.py      # Base Download class
│   │   ├── single.py        # Single file downloads
│   │   ├── archive.py       # Archive extraction
│   │   ├── multiple.py      # Multiple file downloads
│   │   ├── links.py         # Dataset links
│   │   ├── huggingface.py   # HuggingFace downloads
│   │   └── wayback.py       # Internet Archive
│   ├── test/                # Tests
│   └── templates/           # Dataset templates
├── docs/                    # Sphinx documentation
└── pyproject.toml           # Project configuration
```

## Creating a New Repository Plugin

### 1. Project Structure

```
datamaestro_myplugin/
├── src/datamaestro_myplugin/
│   ├── __init__.py          # Repository class
│   ├── config/              # Dataset definitions
│   │   └── com/
│   │       └── example.py   # com.example.* datasets
│   └── data/                # Data type definitions
│       └── __init__.py
├── pyproject.toml
└── README.md
```

### 2. Repository Class

```python
# src/datamaestro_myplugin/__init__.py
from datamaestro.context import Repository

class MyPluginRepository(Repository):
    NAMESPACE = "myplugin"
    DESCRIPTION = "My custom datasets"
```

### 3. Entry Point Registration

```toml
# pyproject.toml
[project.entry-points."datamaestro.repositories"]
myplugin = "datamaestro_myplugin:MyPluginRepository"
```

### 4. Dataset Definition

```python
# src/datamaestro_myplugin/config/com/example.py
from datamaestro.definitions import dataset
from datamaestro.download.single import filedownloader
from datamaestro.data import Base

@filedownloader("data.csv", "http://example.com/data.csv")
@dataset(Base, url="http://example.com")
def my_dataset(data):
    """My example dataset

    Description of the dataset.
    """
    return Base(path=data)
```

## Adding a New Download Handler

```python
# src/datamaestro/download/myhandler.py
from datamaestro.download import Download
from datamaestro.definitions import DatasetAnnotation

class MyDownload(Download):
    def __init__(self, varname: str, url: str):
        super().__init__(varname)
        self.url = url

    def prepare(self):
        """Prepare the resource and return the path/data"""
        # Download logic here
        path = self.download_file(self.url)
        return path

    def download(self, force=False):
        """Download the resource"""
        # Actual download implementation
        ...

def mydownloader(varname: str, url: str):
    """Decorator for my custom download handler"""
    def decorator(dataset):
        download = MyDownload(varname, url)
        download.register(dataset)
        return dataset
    return decorator
```

## Documentation

### Building Documentation

```bash
cd docs
make html
```

Output is in `docs/build/html/`.

### Sphinx Extensions

Datamaestro provides a custom Sphinx extension for documenting datasets:

```rst
.. dm:repository:: text

.. dm:datasets::
```

This automatically generates documentation from registered datasets.

## Release Process

1. Update version in `pyproject.toml`
2. Update CHANGELOG.md
3. Create a git tag: `git tag v1.2.3`
4. Push with tags: `git push --tags`
5. GitHub Actions will build and publish to PyPI

## Troubleshooting

### Import Errors

If you get import errors after installing in development mode:

```bash
uv sync --reinstall
```

### Pre-commit Hook Failures

If pre-commit hooks fail:

```bash
# Update hooks
uv run pre-commit autoupdate

# Clear cache
uv run pre-commit clean
```

### Test Discovery Issues

Ensure test files are named `test_*.py` and are in the `src/datamaestro/test/` directory.