Data Types

Data types define the structure of dataset contents. They inherit from datamaestro.data.Base and use experimaestro’s configuration system for type-safe parameter handling.

Base Types

Base

The root class for all data types:

from datamaestro.data import Base

class MyData(Base):
    """Custom data type"""
    pass
XPM Configdatamaestro.data.Base(*, id)

Base object for all data types

id: str

The unique (sub-)dataset ID

Generic

Generic data with a path:

XPM Configdatamaestro.data.Generic(*, id)

Generic dataset

This allows to set any value, but should only be used as a placeholder

id: str

The unique (sub-)dataset ID

File

Single file reference:

from datamaestro.data import File

# In dataset definition
return File(path=downloaded_path)

# Usage
print(ds.path)  # Path to the file
XPM Configdatamaestro.data.File(*, id, path)

A data file

id: str

The unique (sub-)dataset ID

path: path

The path of the file

CSV Data

Package: datamaestro.data.csv

Generic CSV

from datamaestro.data.csv import Generic

return Generic(
    path=csv_path,
    separator=",",
    names_row=0,  # Header row index
    size=1000,    # Number of rows (optional)
)
XPM Configdatamaestro.data.csv.Generic(*, id, path, delimiter, ignore, names_row)

A generic CSV file

id: str

The unique (sub-)dataset ID

path: path

The path of the file

delimiter: str = ,
ignore: int = 0
names_row: int = -1

Matrix CSV

For numeric CSV data:

from datamaestro.data.csv import Matrix

return Matrix(
    path=csv_path,
    separator=",",
    target=-1,  # Target column index (-1 for last)
)
XPM Configdatamaestro.data.csv.Matrix(*, id, path, delimiter, ignore, names_row, size_row, target)

A numerical dataset

id: str

The unique (sub-)dataset ID

path: path

The path of the file

delimiter: str = ,
ignore: int = 0
names_row: int = -1
size_row: int = -1
target: str

Tensor Data

Package: datamaestro.data.tensor

IDX Format

The IDX format is used by MNIST and similar datasets:

from datamaestro.data.tensor import IDX

idx_data = IDX(path=idx_file_path)

# Load as numpy array
array = idx_data.data()
print(array.shape)  # e.g., (60000, 28, 28)
print(array.dtype)  # e.g., uint8
XPM Configdatamaestro.data.tensor.IDX(*, id, path)

IDX File format

The IDX file format is a simple format for vectors and multidimensional matrices of various numerical types.

The basic format is:

magic number size in dimension 0 size in dimension 1 size in dimension 2 ….. size in dimension N data

The magic number is an integer (MSB first). The first 2 bytes are always 0.

The third byte codes the type of the data: 0x08: unsigned byte 0x09: signed byte 0x0B: short (2 bytes) 0x0C: int (4 bytes) 0x0D: float (4 bytes) 0x0E: double (8 bytes)

The 4-th byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices….

The sizes in each dimension are 4-byte integers (MSB first, high endian, like in most non-Intel processors).

The data is stored like in a C array, i.e. the index in the last dimension changes the fastest.

id: str

The unique (sub-)dataset ID

path: path

The path of the file

Machine Learning

Package: datamaestro.data.ml

Supervised Learning

For supervised learning datasets with train/test splits:

from datamaestro.data.ml import Supervised

return Supervised(
    train=train_data,
    test=test_data,
    validation=validation_data,  # Optional
)
XPM Configdatamaestro.data.ml.Supervised(*, id, train, validation, test)
id: str

The unique (sub-)dataset ID

train: datamaestro.data.Base

The training dataset

validation: datamaestro.data.Base

The validation dataset (optional)

test: datamaestro.data.Base

The training optional

HuggingFace Integration

Package: datamaestro.data.huggingface

For datasets from the HuggingFace Hub:

from datamaestro.data.huggingface import DatasetDict

return DatasetDict(
    dataset_id="squad",
    config=None,  # Optional config name
)

Creating Custom Data Types

Basic Custom Type

Create custom data types by inheriting from Base. Use Param from experimaestro to define typed parameters:

from pathlib import Path
from experimaestro import Param
from datamaestro.data import Base

class TextCorpus(Base):
    """A text corpus with documents"""

    path: Param[Path]
    """Path to the corpus directory"""

    encoding: Param[str] = "utf-8"
    """Text encoding"""

    def documents(self):
        """Iterate over documents"""
        for file in self.path.glob("*.txt"):
            yield file.read_text(encoding=self.encoding)

    def __len__(self):
        return len(list(self.path.glob("*.txt")))

Nested Data Types

from experimaestro import Param
from datamaestro.data import Base

class LabelledData(Base):
    """Data with labels"""

    data: Param[Base]
    """The actual data"""

    labels: Param[Base]
    """The labels"""

class ImageClassification(Base):
    """Image classification dataset"""

    train: Param[LabelledData]
    """Training split"""

    test: Param[LabelledData]
    """Test split"""

    num_classes: Param[int]
    """Number of classes"""

With Data Loading Methods

from pathlib import Path
from experimaestro import Param
from datamaestro.data import Base

class JSONLData(Base):
    """JSON Lines format data"""

    path: Param[Path]

    def __iter__(self):
        """Iterate over records"""
        import json
        with open(self.path) as f:
            for line in f:
                yield json.loads(line)

    def to_pandas(self):
        """Load as pandas DataFrame"""
        import pandas as pd
        return pd.read_json(self.path, lines=True)

    def to_list(self):
        """Load all records into a list"""
        return list(self)

Inheriting from Existing Types

from datamaestro.data.csv import Matrix

class ClassificationMatrix(Matrix):
    """CSV matrix for classification tasks"""

    num_classes: Param[int]
    """Number of target classes"""

    class_names: Param[list] = None
    """Optional class names"""

    def get_class_name(self, index: int) -> str:
        if self.class_names:
            return self.class_names[index]
        return str(index)

Type Annotations with Experimaestro

Data types use experimaestro’s annotation system (Param, Option, Meta):

from experimaestro import Param, Option, Meta
from datamaestro.data import Base

class MyData(Base):
    # Required parameter
    path: Param[Path]

    # Optional parameter with default
    encoding: Param[str] = "utf-8"

    # Option (not serialized, for runtime configuration)
    cache_size: Option[int] = 1000

    # Metadata (not part of configuration identity)
    description: Meta[str] = ""

See the experimaestro documentation for more details on the configuration system.