Dataset Variants

A dataset family can expose a set of variant axes that callers select at prepare time. The selected combination is addressed with a query-style suffix on the dataset id:

ai.lighton.embeddings_pre_training[name=agnews,streaming=true]

Unspecified axes fall back to declared defaults. Every resolved axis — defaults included — participates in the experimaestro identity hash, so caches stay disjoint per variant.

Why variants?

Without variants, every concrete parameter combination needs its own registered id. A family with 73 configs × two loading modes × four filter flags blows up into hundreds of ids. Variants let you register one family with declared axes and resolve specific combinations on demand — datamaestro prepare understands the bracketed suffix out of the box, and datamaestro search can match on axis values.

The Variants contract

Variants is an abstract description of a dataset family’s variant space. A concrete subclass implements four methods:

  • resolve(**kwargs) — validate incoming kwargs and fill in defaults.

  • parse_selector(selector) — parse "[k=v,k=v,…]" into kwargs.

  • format_selector(kwargs) — render the canonical "[k=v,…]" form. Includes every resolved value (defaults inclusive) for reproducibility.

  • enumerate() — yield the finite variants for listing/expansion.

The default matches() uses enumerate() to support substring search over enumerable axis values.

Register alternative schemes (named presets, rule-based combinations, …) by subclassing Variants directly.

class datamaestro.variants.Variants

Abstract description of a dataset family’s variant space.

Subclasses define how a selector string translates to concrete kwargs passed into a dataset factory. The most common case is AxesVariants (cartesian product of independent axes); other schemes (named presets, rule-based combinations) can implement this contract directly.

document() str

Return a Markdown description of this variant space.

Base behaviour is to emit the subclass’s own docstring (if any). Specialized Variants subclasses should override this and may call super().document() to chain onto the docstring.

The returned string is rendered by documentation consumers (e.g. the Sphinx dm:datasets directive) through a Markdown parser; keep the output framework-agnostic.

abstractmethod enumerate() Iterator[Dict[str, Any]]

Yield the finite variants (if any) for listing/expansion.

abstractmethod format_selector(kwargs: Dict[str, Any]) str

Render a canonical "[k=v,...]" string for kwargs.

Implementations MUST include every resolved value (defaults inclusive) so two equivalent selectors format identically and round-trip through the id.

matches(query: str) bool

True if query is a substring of any enumerable axis value. Used by datamaestro search; subclasses can override.

abstractmethod parse_selector(selector: str) Dict[str, Any]

Parse a "[k=v,k=v]" selector into a kwargs dict.

Values are coerced using the axis’ declared type. An empty selector ("" or "[]") returns {}.

abstractmethod resolve(**kwargs: Any) Dict[str, Any]

Validate kwargs and return the full canonical variant.

The returned dict MUST include every axis (with defaults filled in) so that it participates in the downstream identity hash. Raise ValueError for unknown axes or missing required axes.

AxesVariants — cartesian product of axes

AxesVariants is the most common case: one independent axis per dimension. Two construction styles, both accepted anywhere Variants is:

Declarative — axes as class attributes:

from datamaestro.variants import AxesVariants, Axis

class EmbeddingsVariants(AxesVariants):
    name             = Axis(CONFIGS)                       # required
    streaming        = Axis([False, True], default=True, type=bool)
    filter_drop      = Axis([True, False], default=True, type=bool)
    min_similarity   = Axis(type=float, default=None)      # open axis

Imperative — axes passed to the constructor:

from datamaestro.variants import AxesVariants, Axis

variants = AxesVariants(
    name=Axis(CONFIGS),
    streaming=Axis([False, True], default=True, type=bool),
    min_similarity=Axis(type=float, default=None),
)
class datamaestro.variants.AxesVariants(**axes: Axis)

Cartesian product of independent axes.

Two construction styles, both accepted anywhere Variants is:

Declarative (recommended for readability):

class MyVariants(AxesVariants):
    name = Axis(["small", "large"])
    streaming = Axis([False, True], default=True, type=bool)

Imperative:

v = AxesVariants(
    name=Axis(["small", "large"]),
    streaming=Axis([False, True], default=True, type=bool),
)
property axes: Mapping[str, Axis]

Read-only view of the axes declared on this variant space.

enumerate() Iterator[Dict[str, Any]]

Cartesian product over enumerable axes.

Open axes (domain=None) with a default contribute that default; open axes without a default are omitted (they cannot be expanded).

format_selector(kwargs: Dict[str, Any]) str

Canonical string form of kwargs.

Keys are emitted alphabetically (deterministic). Axes declared with in_id=False are always omitted. Axes declared with elide_default=True are omitted when their resolved value equals the default. If every axis ends up omitted, returns "" so the dataset id suffix disappears entirely.

matches(query: str) bool

True if query is a substring of any enumerable axis value. Used by datamaestro search; subclasses can override.

parse_selector(selector: str) Dict[str, Any]

Parse a "[k=v,k=v]" selector into a kwargs dict.

Values are coerced using the axis’ declared type. An empty selector ("" or "[]") returns {}.

resolve(**kwargs: Any) Dict[str, Any]

Validate kwargs and return the full canonical variant.

The returned dict MUST include every axis (with defaults filled in) so that it participates in the downstream identity hash. Raise ValueError for unknown axes or missing required axes.

Axis — one dimension

Axis declares a single variant dimension.

  • Axis([v1, v2, ...]) — discrete enumerable domain; type is inferred from the values if they share one.

  • Axis(type=T) — open axis; any value coercible to T is accepted. T may be Optional[T_inner]; then the literal null/none in a selector maps to None.

  • default=v — value when the axis is omitted from a selector. If you don’t set a default, the axis is required.

Supported coercions for selector strings:

Type

"true" / "True"True

"false" / "False"False

"42"int

"3.14"float

"null"None

bool

int

float

str

Optional[T]

defers to T

defers to T

defers to T

defers to T

class datamaestro.variants.Axis(domain: List[Any] | None = None, *, type: Any | None = None, default: Any = MISSING, description: str = '', elide_default: bool = False, in_id: bool = True)

One dimension of a variant space.

Parameters:
  • domain – Either a concrete list of allowed values (discrete, enumerable axis) or None (open axis — any value coerced via type is accepted).

  • type – Explicit Python type used to coerce selector strings. If omitted, inferred from the domain when homogeneous. Supports str, int, float, bool, and Optional[T]. For open axes without type, values are kept as strings.

  • default – Value returned when the axis is not set in a selector. Omit to mark the axis as required.

  • description – Optional human-readable label (used by search UI).

  • elide_default – When True, omit this axis from the formatted selector (and therefore from the dataset id suffix) whenever its resolved value equals default. Lets a family grow new axes without changing the id of already-prepared variants. Requires default to be set.

  • in_id – When False, this axis is always excluded from the formatted selector (and thus from the dataset id). The axis still participates in AxesVariants.resolve() and in the cache key, so its value still reaches Dataset.config(). Use for download-time flags (or similar) whose value doesn’t change the output config. Subsumes elide_default — when in_id=False the axis is omitted regardless of the value.

coerce(raw: Any) Any

Coerce a value (typically a string from query syntax) using the axis’ declared type.

Returns the raw value unchanged if no type is known and no special marker is detected.

property enumerable: bool

True iff the axis has a finite enumerated domain.

validate(value: Any) None

Raise ValueError if value is outside an enumerable domain. Open axes accept any value.

Query-syntax reference

<dataset-id>[key1=value1,key2=value2,…]
  • The brackets are literal; fragments separated by ,.

  • Whitespace around = and , is trimmed.

  • Unknown keys raise ValueError.

  • Absent axes fall back to declared defaults.

  • An empty selector ([] or [  ]) resolves to all-defaults.

The canonical form emitted by format_selector always lists every axis (defaults inclusive), sorted by axis name, so two equivalent selectors render identically and round-trip through the id.

Canonical id includes defaults

The user-facing form produced by format_selector includes every axis. A user who types

ai.lighton.embeddings_pre_training[name=agnews]

sees back

ai.lighton.embeddings_pre_training[filter_drop=true,filter_duplicate=true,min_similarity=null,name=agnews,streaming=true,top_percentile=null]

as the identity-bearing canonical form. The short form is accepted as shorthand; the canonical form is what drives the identity hash.