Dataset Variants
A dataset family can expose a set of variant axes that callers select at prepare time. The selected combination is addressed with a query-style suffix on the dataset id:
ai.lighton.embeddings_pre_training[name=agnews,streaming=true]
Unspecified axes fall back to declared defaults. Every resolved axis — defaults included — participates in the experimaestro identity hash, so caches stay disjoint per variant.
Why variants?
Without variants, every concrete parameter combination needs its own
registered id. A family with 73 configs × two loading modes × four filter
flags blows up into hundreds of ids. Variants let you register one
family with declared axes and resolve specific combinations on demand —
datamaestro prepare understands the bracketed suffix out of the box, and
datamaestro search can match on axis values.
The Variants contract
Variants is an abstract description of a
dataset family’s variant space. A concrete subclass implements four
methods:
resolve(**kwargs)— validate incoming kwargs and fill in defaults.parse_selector(selector)— parse"[k=v,k=v,…]"into kwargs.format_selector(kwargs)— render the canonical"[k=v,…]"form. Includes every resolved value (defaults inclusive) for reproducibility.enumerate()— yield the finite variants for listing/expansion.
The default matches() uses
enumerate() to support substring search over enumerable axis values.
Register alternative schemes (named presets, rule-based combinations, …)
by subclassing Variants directly.
- class datamaestro.variants.Variants
Abstract description of a dataset family’s variant space.
Subclasses define how a selector string translates to concrete kwargs passed into a dataset factory. The most common case is
AxesVariants(cartesian product of independent axes); other schemes (named presets, rule-based combinations) can implement this contract directly.- document() str
Return a Markdown description of this variant space.
Base behaviour is to emit the subclass’s own docstring (if any). Specialized
Variantssubclasses should override this and may callsuper().document()to chain onto the docstring.The returned string is rendered by documentation consumers (e.g. the Sphinx
dm:datasetsdirective) through a Markdown parser; keep the output framework-agnostic.
- abstractmethod enumerate() Iterator[Dict[str, Any]]
Yield the finite variants (if any) for listing/expansion.
- abstractmethod format_selector(kwargs: Dict[str, Any]) str
Render a canonical
"[k=v,...]"string forkwargs.Implementations MUST include every resolved value (defaults inclusive) so two equivalent selectors format identically and round-trip through the id.
- matches(query: str) bool
True if
queryis a substring of any enumerable axis value. Used bydatamaestro search; subclasses can override.
AxesVariants — cartesian product of axes
AxesVariants is the most common case:
one independent axis per dimension. Two construction styles, both
accepted anywhere Variants is:
Declarative — axes as class attributes:
from datamaestro.variants import AxesVariants, Axis
class EmbeddingsVariants(AxesVariants):
name = Axis(CONFIGS) # required
streaming = Axis([False, True], default=True, type=bool)
filter_drop = Axis([True, False], default=True, type=bool)
min_similarity = Axis(type=float, default=None) # open axis
Imperative — axes passed to the constructor:
from datamaestro.variants import AxesVariants, Axis
variants = AxesVariants(
name=Axis(CONFIGS),
streaming=Axis([False, True], default=True, type=bool),
min_similarity=Axis(type=float, default=None),
)
- class datamaestro.variants.AxesVariants(**axes: Axis)
Cartesian product of independent axes.
Two construction styles, both accepted anywhere
Variantsis:Declarative (recommended for readability):
class MyVariants(AxesVariants): name = Axis(["small", "large"]) streaming = Axis([False, True], default=True, type=bool)
Imperative:
v = AxesVariants( name=Axis(["small", "large"]), streaming=Axis([False, True], default=True, type=bool), )
- enumerate() Iterator[Dict[str, Any]]
Cartesian product over enumerable axes.
Open axes (
domain=None) with a default contribute that default; open axes without a default are omitted (they cannot be expanded).
- format_selector(kwargs: Dict[str, Any]) str
Canonical string form of
kwargs.Keys are emitted alphabetically (deterministic). Axes declared with
in_id=Falseare always omitted. Axes declared withelide_default=Trueare omitted when their resolved value equals the default. If every axis ends up omitted, returns""so the dataset id suffix disappears entirely.
- matches(query: str) bool
True if
queryis a substring of any enumerable axis value. Used bydatamaestro search; subclasses can override.
Axis — one dimension
Axis declares a single variant
dimension.
Axis([v1, v2, ...])— discrete enumerable domain; type is inferred from the values if they share one.Axis(type=T)— open axis; any value coercible toTis accepted.Tmay beOptional[T_inner]; then the literalnull/nonein a selector maps toNone.default=v— value when the axis is omitted from a selector. If you don’t set a default, the axis is required.
Supported coercions for selector strings:
Type |
|
|
|
|
|
|---|---|---|---|---|---|
|
✔ |
✔ |
|||
|
✔ |
||||
|
✔ |
✔ |
|||
|
|||||
|
defers to |
defers to |
defers to |
defers to |
✔ |
- class datamaestro.variants.Axis(domain: List[Any] | None = None, *, type: Any | None = None, default: Any = MISSING, description: str = '', elide_default: bool = False, in_id: bool = True)
One dimension of a variant space.
- Parameters:
domain – Either a concrete list of allowed values (discrete, enumerable axis) or
None(open axis — any value coerced viatypeis accepted).type – Explicit Python type used to coerce selector strings. If omitted, inferred from the domain when homogeneous. Supports
str,int,float,bool, andOptional[T]. For open axes withouttype, values are kept as strings.default – Value returned when the axis is not set in a selector. Omit to mark the axis as required.
description – Optional human-readable label (used by search UI).
elide_default – When True, omit this axis from the formatted selector (and therefore from the dataset id suffix) whenever its resolved value equals
default. Lets a family grow new axes without changing the id of already-prepared variants. Requiresdefaultto be set.in_id – When False, this axis is always excluded from the formatted selector (and thus from the dataset id). The axis still participates in
AxesVariants.resolve()and in the cache key, so its value still reachesDataset.config(). Use for download-time flags (or similar) whose value doesn’t change the output config. Subsumeselide_default— whenin_id=Falsethe axis is omitted regardless of the value.
Query-syntax reference
<dataset-id>[key1=value1,key2=value2,…]
The brackets are literal; fragments separated by
,.Whitespace around
=and,is trimmed.Unknown keys raise
ValueError.Absent axes fall back to declared defaults.
An empty selector (
[]or[ ]) resolves to all-defaults.
The canonical form emitted by format_selector always lists every axis
(defaults inclusive), sorted by axis name, so two equivalent selectors
render identically and round-trip through the id.
Canonical id includes defaults
The user-facing form produced by format_selector includes every axis.
A user who types
ai.lighton.embeddings_pre_training[name=agnews]
sees back
ai.lighton.embeddings_pre_training[filter_drop=true,filter_duplicate=true,min_similarity=null,name=agnews,streaming=true,top_percentile=null]
as the identity-bearing canonical form. The short form is accepted as shorthand; the canonical form is what drives the identity hash.