# Getting Started

This guide will help you get up and running with datamaestro quickly.

## Installation

Install the core package:

```bash
pip install datamaestro
```

Install domain-specific plugins as needed:

```bash
pip install datamaestro-text   # NLP datasets
pip install datamaestro-image  # Image datasets
pip install datamaestro-ml     # ML datasets
```

## Basic Usage

### Finding Datasets

Use the `search` command to find datasets:

```bash
# Search by name
datamaestro search mnist

# Search by tag
datamaestro search tag:classification

# Search by task
datamaestro search task:image-classification

# Search in a specific repository
datamaestro search repo:image mnist

# Combine search terms (AND)
datamaestro search mnist tag:benchmark
```

### Getting Dataset Information

```bash
datamaestro info com.lecun.mnist
```

Output:
```
com.lecun.mnist
http://yann.lecun.com/exdb/mnist/
Types (ids): datamaestro_image.data.ImageClassification
Tags: benchmark, classification
Tasks: image-classification

The MNIST database of handwritten digits...
```

### Downloading Datasets

#### Command Line

```bash
# Download only (no preparation)
datamaestro download com.lecun.mnist

# Download and prepare (returns JSON)
datamaestro prepare com.lecun.mnist
```

#### Python API

Use {py:func}`~datamaestro.context.prepare_dataset` to download and access datasets:

```python
from datamaestro import prepare_dataset

# Download and prepare the dataset
ds = prepare_dataset("com.lecun.mnist")

# Access training data
train_images = ds.train.images.data()  # numpy array
train_labels = ds.train.labels.data()  # numpy array

print(f"Training samples: {train_images.shape[0]}")
print(f"Image shape: {train_images.shape[1:]}")
```

## Working with Different Data Types

### CSV Data

```python
from datamaestro import prepare_dataset

ds = prepare_dataset("some.csv.dataset")

# Get the file path
csv_path = ds.path

# Or use pandas integration if available
import pandas as pd
df = pd.read_csv(ds.path)
```

### Tensor Data (IDX format)

The {py:class}`~datamaestro.data.tensor.IDX` type handles MNIST-style IDX files:

```python
ds = prepare_dataset("com.lecun.mnist")

# IDX files are automatically parsed to numpy arrays
images = ds.train.images.data()  # Returns numpy array
labels = ds.train.labels.data()  # Returns numpy array
```

### HuggingFace Datasets

Some datasets integrate with HuggingFace:

```python
ds = prepare_dataset("some.huggingface.dataset")

# Access the HuggingFace dataset object
hf_dataset = ds.dataset
```

## Using with Experimaestro

Datamaestro integrates seamlessly with [experimaestro](http://experimaestro.github.io/experimaestro-python/) for experiment management:

```python
from experimaestro import experiment
from datamaestro import prepare_dataset

@experiment()
def my_experiment():
    # Datasets are automatically tracked
    ds = prepare_dataset("com.lecun.mnist")

    # Use in your experiment
    train_model(ds.train)
```

## Data Storage Location

By default, datasets are stored in `~/datamaestro/data/`. You can change this:

### Environment Variable

```bash
export DATAMAESTRO_DIR=/path/to/data
datamaestro prepare com.lecun.mnist
```

### Command Line Option

```bash
datamaestro --data /path/to/data prepare com.lecun.mnist
```

### Python API

```python
from pathlib import Path
from datamaestro import prepare_dataset

ds = prepare_dataset("com.lecun.mnist", context=Path("/path/to/data"))
```

## Next Steps

- [Dataset Definition](datasets.rst): Learn how to define your own datasets
- [CLI Reference](cli.md): Complete command line reference
- [Configuration](configuration.md): Advanced configuration options
- [API Reference](api/index.md): Full API documentation