# Configuration

Datamaestro uses a hierarchical configuration system with global settings, user preferences, and per-invocation options.

## Data Directory

The main data directory stores all downloaded datasets. Default location: `~/datamaestro/`

### Setting the Data Directory

**Environment variable (recommended for system-wide configuration):**

```bash
export DATAMAESTRO_DIR=/path/to/data
```

**Command line option:**

```bash
datamaestro --data /path/to/data prepare com.lecun.mnist
```

**Python API:**

```python
from pathlib import Path
from datamaestro import prepare_dataset

ds = prepare_dataset("com.lecun.mnist", context=Path("/path/to/data"))
```

## Directory Structure

```
~/datamaestro/
├── data/                    # Downloaded datasets
│   ├── image/              # Image repository datasets
│   │   └── com/lecun/mnist/
│   ├── text/               # Text repository datasets
│   └── ml/                 # ML repository datasets
├── cache/                   # Temporary download cache
└── settings.json           # Global settings
```

## Settings Files

### Global Settings

Located at `$DATAMAESTRO_DIR/settings.json`:

```json
{
  "datafolders": {
    "large_data": "/mnt/storage/datasets",
    "local_cache": "/tmp/datamaestro"
  }
}
```

### User Settings

Located at `~/.config/datamaestro/user.json`:

```json
{
  "default_repository": "text"
}
```

## Data Folders

Data folders allow datasets to reference pre-existing data locations without copying files.

### Configuring Data Folders

**Command line:**

```bash
# Set a data folder
datamaestro datafolders set my_data /path/to/existing/data

# List configured folders
datamaestro datafolders list
```

**In settings.json:**

```json
{
  "datafolders": {
    "my_data": "/path/to/existing/data"
  }
}
```

### Using Data Folders in Dataset Definitions

Use {py:class}`~datamaestro.context.DatafolderPath` to reference configured folders:

```python
from datamaestro.context import DatafolderPath
from datamaestro.definitions import dataset

@dataset(MyDataType)
def my_dataset():
    # Reference a file in a configured data folder
    path = DatafolderPath("my_data", "subdir/file.csv")
    return MyDataType(path=path)
```

## Context API

The {py:class}`~datamaestro.context.Context` class manages all configuration state:

```python
from datamaestro.context import Context

# Get the singleton context instance
ctx = Context.instance()

# Access paths
print(ctx.datapath)    # ~/datamaestro/data
print(ctx.cachepath)   # ~/datamaestro/cache

# Access settings
print(ctx.settings.datafolders)

# Iterate over repositories
for repo in ctx.repositories():
    print(repo.id, repo.name)

# Find a dataset
ds = ctx.dataset("com.lecun.mnist")
```

## Caching

Downloaded files are cached in `$DATAMAESTRO_DIR/cache/` to avoid re-downloading.

### Cache Behavior

- Files are identified by URL hash
- Cache is checked before downloading
- Use `--keep-downloads` to preserve archive files after extraction

### Clearing the Cache

```bash
rm -rf ~/datamaestro/cache/*
```

## Remote Execution

Datamaestro supports remote execution via experimaestro's rpyc integration:

```bash
datamaestro --host remote-server --pythonpath /usr/bin/python3 prepare com.lecun.mnist
```

This connects to the remote host and executes datamaestro there, useful for:
- Downloading data directly to a compute cluster
- Accessing datasets on remote storage

## Repository Registration

Repositories are registered via Python entry points in `pyproject.toml`:

```toml
[project.entry-points."datamaestro.repositories"]
myrepo = "mypackage:MyRepository"
```

The repository class must inherit from `Repository`:

```python
from datamaestro.context import Repository

class MyRepository(Repository):
    NAMESPACE = "myrepo"
    DESCRIPTION = "My custom datasets"
```