Data Types
Data types define the structure of dataset contents. They inherit from datamaestro.data.Base
and use experimaestro’s configuration system for type-safe parameter handling.
Base Types
Base
The root class for all data types:
Generic
Generic data with a path:
File
Single file reference:
CSV Data
Package: datamaestro.data.csv
Generic CSV
Matrix CSV
For numeric CSV data:
Tensor Data
Package: datamaestro.data.tensor
IDX Format
The IDX format is used by MNIST and similar datasets:
- XPM Configdatamaestro.data.tensor.IDX(*, id, path)
IDX File format
The IDX file format is a simple format for vectors and multidimensional matrices of various numerical types.
The basic format is:
magic number size in dimension 0 size in dimension 1 size in dimension 2 ….. size in dimension N data
The magic number is an integer (MSB first). The first 2 bytes are always 0.
The third byte codes the type of the data: 0x08: unsigned byte 0x09: signed byte 0x0B: short (2 bytes) 0x0C: int (4 bytes) 0x0D: float (4 bytes) 0x0E: double (8 bytes)
The 4-th byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices….
The sizes in each dimension are 4-byte integers (MSB first, high endian, like in most non-Intel processors).
The data is stored like in a C array, i.e. the index in the last dimension changes the fastest.
Machine Learning
Package: datamaestro.data.ml
Supervised Learning
For supervised learning datasets with train/test splits:
from datamaestro.data.ml import Supervised
return Supervised(
train=train_data,
test=test_data,
validation=validation_data, # Optional
)
- XPM Configdatamaestro.data.ml.Supervised(*, id, train, validation, test)
-
- train: datamaestro.data.Base
The training dataset
- validation: datamaestro.data.Base
The validation dataset (optional)
- test: datamaestro.data.Base
The training optional
HuggingFace Integration
Package: datamaestro.data.huggingface
For datasets from the HuggingFace Hub:
from datamaestro.data.huggingface import DatasetDict
return DatasetDict(
dataset_id="squad",
config=None, # Optional config name
)
Creating Custom Data Types
Basic Custom Type
Create custom data types by inheriting from Base.
Use Param from experimaestro to define typed parameters:
from pathlib import Path
from experimaestro import Param
from datamaestro.data import Base
class TextCorpus(Base):
"""A text corpus with documents"""
path: Param[Path]
"""Path to the corpus directory"""
encoding: Param[str] = "utf-8"
"""Text encoding"""
def documents(self):
"""Iterate over documents"""
for file in self.path.glob("*.txt"):
yield file.read_text(encoding=self.encoding)
def __len__(self):
return len(list(self.path.glob("*.txt")))
Nested Data Types
from experimaestro import Param
from datamaestro.data import Base
class LabelledData(Base):
"""Data with labels"""
data: Param[Base]
"""The actual data"""
labels: Param[Base]
"""The labels"""
class ImageClassification(Base):
"""Image classification dataset"""
train: Param[LabelledData]
"""Training split"""
test: Param[LabelledData]
"""Test split"""
num_classes: Param[int]
"""Number of classes"""
With Data Loading Methods
from pathlib import Path
from experimaestro import Param
from datamaestro.data import Base
class JSONLData(Base):
"""JSON Lines format data"""
path: Param[Path]
def __iter__(self):
"""Iterate over records"""
import json
with open(self.path) as f:
for line in f:
yield json.loads(line)
def to_pandas(self):
"""Load as pandas DataFrame"""
import pandas as pd
return pd.read_json(self.path, lines=True)
def to_list(self):
"""Load all records into a list"""
return list(self)
Inheriting from Existing Types
from datamaestro.data.csv import Matrix
class ClassificationMatrix(Matrix):
"""CSV matrix for classification tasks"""
num_classes: Param[int]
"""Number of target classes"""
class_names: Param[list] = None
"""Optional class names"""
def get_class_name(self, index: int) -> str:
if self.class_names:
return self.class_names[index]
return str(index)
Type Annotations with Experimaestro
Data types use experimaestro’s annotation system (Param,
Option, Meta):
from experimaestro import Param, Option, Meta
from datamaestro.data import Base
class MyData(Base):
# Required parameter
path: Param[Path]
# Optional parameter with default
encoding: Param[str] = "utf-8"
# Option (not serialized, for runtime configuration)
cache_size: Option[int] = 1000
# Metadata (not part of configuration identity)
description: Meta[str] = ""
See the experimaestro documentation for more details on the configuration system.