Download Resources ================== Resources represent steps in a dataset preparation pipeline. They form a directed acyclic graph (DAG) where each resource can depend on other resources. Key concepts: - **Two-path system**: resources write to ``transient_path`` during download, then the framework moves data to ``path`` and marks the resource as COMPLETE. - **Three states**: NONE, PARTIAL, COMPLETE (persisted in ``.state.json``) - **Transient resources**: intermediate resources that can be deleted after all dependents are COMPLETE (eager cleanup) Resource Hierarchy ------------------ .. code-block:: text Resource (ABC) ├── FileResource — produces a single file ├── FolderResource — produces a directory │ └── FilesCopy — copies files from another resource ├── ValueResource — produces an in-memory value (no files) ├── reference — references another dataset └── Download — (deprecated alias for Resource) Resource Base Class ------------------- .. autoclass:: datamaestro.download.Resource :members: name, dataset, transient, can_recover, dependencies, dependents, path, transient_path, state, download, prepare, cleanup, has_files, bind, apply ResourceState ~~~~~~~~~~~~~ .. autoclass:: datamaestro.download.ResourceState :members: FileResource ~~~~~~~~~~~~ .. autoclass:: datamaestro.download.FileResource :members: path, transient_path, prepare, stream, download FolderResource ~~~~~~~~~~~~~~ .. autoclass:: datamaestro.download.FolderResource :members: path, transient_path, prepare, download ValueResource ~~~~~~~~~~~~~ .. autoclass:: datamaestro.download.ValueResource :members: has_files, prepare Defining Resources (Modern API) =============================== Resources are defined as class attributes on dataset classes. The framework automatically detects them and builds the dependency graph. Single Files ------------ Package: ``datamaestro.download.single`` Use :py:class:`~datamaestro.download.single.FileDownloader` for single file downloads: .. code-block:: python from datamaestro.download.single import FileDownloader from datamaestro.definitions import AbstractDataset, dataset @dataset(url="http://example.com") class MyDataset(CSVData): DATA = FileDownloader("data.csv", "http://example.com/data.csv") @classmethod def __create_dataset__(cls, dataset: AbstractDataset): return cls.C(path=cls.DATA.path) .. autoclass:: datamaestro.download.single.FileDownloader **Automatic Decompression:** Files with ``.gz`` or ``.bz2`` extensions are automatically decompressed: .. code-block:: python # Downloads and decompresses to data.txt DATA = FileDownloader( "data.txt", "http://example.com/data.txt.gz" ) ConcatDownloader ~~~~~~~~~~~~~~~~ Downloads multiple files and concatenates them: .. autoclass:: datamaestro.download.single.ConcatDownloader .. code-block:: python COMBINED = ConcatDownloader( "combined.txt", "http://example.com/part1.txt", "http://example.com/part2.txt", "http://example.com/part3.txt", ) Archives -------- Package: ``datamaestro.download.archive`` Archive resources extract archives and produce a directory. ZipDownloader ~~~~~~~~~~~~~ Downloads and extracts ZIP archives: .. autoclass:: datamaestro.download.archive.ZipDownloader .. code-block:: python from datamaestro.download.archive import ZipDownloader @dataset(url="http://example.com") class MyDataset(MyData): DATA = ZipDownloader("data", "http://example.com/archive.zip") @classmethod def __create_dataset__(cls, dataset: AbstractDataset): return cls.C(path=cls.DATA.path / "file.csv") **Parameters:** - ``varname``: Resource name - ``url``: URL to the ZIP file - ``files``: Optional list of files to extract (default: all) - ``subpath``: Extract only a subdirectory TarDownloader ~~~~~~~~~~~~~ Downloads and extracts TAR archives (including .tar.gz, .tar.bz2): .. autoclass:: datamaestro.download.archive.TarDownloader .. code-block:: python from datamaestro.download.archive import TarDownloader @dataset(url="http://example.com") class MyDataset(MyData): DATA = TarDownloader( "data", "http://example.com/archive.tar.gz" ) HuggingFace Integration ----------------------- Package: ``datamaestro.download.huggingface`` For datasets hosted on HuggingFace Hub: .. autoclass:: datamaestro.download.huggingface.HFDownloader .. code-block:: python from datamaestro.download.huggingface import HFDownloader @dataset(url="https://huggingface.co/datasets/squad") class Squad(QADataset): HF_DATA = HFDownloader("squad_data", "squad") Links ----- Package: ``datamaestro.download.links`` Links reference other datasets or external data folders. .. code-block:: python from datamaestro.download.links import links @dataset(url="http://example.com") class ExtendedDataset(ExtendedData): BASE = links("base", "com.example.base_dataset") @classmethod def __create_dataset__(cls, dataset: AbstractDataset): return cls.C(base=cls.BASE.prepare()) linkfolder ~~~~~~~~~~ Link to a configured data folder: .. code-block:: python from datamaestro.download.links import linkfolder @dataset(url="http://example.com") class ExternalDataset(MyData): DATA = linkfolder("data", "mydata") linkfile ~~~~~~~~ Link to a specific file in a data folder: .. code-block:: python from datamaestro.download.links import linkfile @dataset(url="http://example.com") class SpecificFile(MyData): CSV = linkfile("csvfile", "mydata", "subdir/data.csv") Internet Archive (Wayback Machine) ----------------------------------- Package: ``datamaestro.download.wayback`` For datasets that are no longer available at their original URLs: .. code-block:: python from datamaestro.download.wayback import wayback_documents @dataset(url="http://example.com") class ArchivedDataset(MyData): DATA = wayback_documents( "data", "http://defunct-website.com/data.csv", timestamp="20200101" ) Custom Downloads ---------------- Package: ``datamaestro.download.custom`` For complex download logic that doesn't fit standard patterns: .. autoclass:: datamaestro.download.custom.Downloader .. autofunction:: datamaestro.download.custom.custom_download reference --------- References another dataset instead of downloading: .. autoclass:: datamaestro.download.reference .. code-block:: python from datamaestro.download import reference @dataset(url="http://example.com") class DerivedDataset(MyData): BASE = reference("base", reference=other_dataset) FilesCopy --------- Package: ``datamaestro.download`` Copies specific files from a source resource (typically a transient archive) into a persistent folder. This is useful when you only need a few files from a large archive and want the archive to be cleaned up after extraction. .. autoclass:: datamaestro.download.FilesCopy .. code-block:: python from datamaestro.download.archive import ZipDownloader from datamaestro.download import FilesCopy from datamaestro.definitions import Dataset, dataset @dataset(url="http://example.com") class MyDataset(Dataset): # Archive is transient — deleted after FilesCopy completes ARCHIVE = ZipDownloader("data", "http://example.com/data.zip", transient=True) # Copy only the files we need FILES = FilesCopy(ARCHIVE, { "queries.jsonl": "queries.jsonl", "train.tsv": "qrels/train.tsv", "test.tsv": "qrels/test.tsv", }) def config(self): return MyType.C( queries=self.FILES.path / "queries.jsonl", train_qrels=self.FILES.path / "train.tsv", ) **Parameters:** - ``source``: The source resource whose ``path`` contains the files to copy. - ``files``: A mapping of ``{dest_filename: relative_src_path}``. Each entry copies ``source.path / relative_src_path`` to ``self.path / dest_filename``. ``FilesCopy`` automatically declares a dependency on ``source``, so the source is always downloaded and completed before the copy runs. Transient Resources & Pipelines ================================ Resources can be marked as ``transient``, meaning their data can be deleted after all downstream dependents reach COMPLETE state. This is useful for intermediate files in processing pipelines. .. code-block:: python from datamaestro.download.single import FileDownloader @dataset(url="http://example.com") class ProcessedDataset(MyData): # Raw download — deleted after processing completes RAW = FileDownloader( "raw.gz", "http://example.com/data.gz", transient=True, ) # Processed output — kept permanently PROCESSED = MyProcessor.from_file(RAW) @classmethod def __create_dataset__(cls, dataset: AbstractDataset): return cls.C(path=cls.PROCESSED.path) Creating Custom Resource Handlers ================================== Extend the download system by subclassing ``FileResource``, ``FolderResource``, or ``ValueResource``: .. code-block:: python from datamaestro.download import FileResource class MyProcessor(FileResource): """Process a source file into a numpy array.""" @property def can_recover(self) -> bool: return False def __init__(self, filename, source, **kw): super().__init__(filename, **kw) self._dependencies = [source] def _download(self, destination): source_path = self.dependencies[0].path data = load(source_path) save(process(data), destination) @classmethod def from_source(cls, source): return cls("processed.npy", source) # Factory alias my_processor = MyProcessor.from_source The ``_download(destination)`` method receives ``self.transient_path`` as ``destination``. After it returns, the framework moves data from ``transient_path`` to ``path`` and marks the resource as COMPLETE. File Validation =============== Package: ``datamaestro.utils`` Validate downloaded files with checksums: .. autoclass:: datamaestro.utils.FileChecker .. autoclass:: datamaestro.utils.HashCheck :members: __init__ .. code-block:: python from datamaestro.download.single import FileDownloader from datamaestro.utils import HashCheck DATA = FileDownloader( "data.csv", "http://example.com/data.csv", checker=HashCheck("sha256", "abc123def456...") ) **Supported hash algorithms:** ``md5``, ``sha1``, ``sha256``, ``sha512`` Two-Path Download Flow ====================== The framework orchestrates the download process for each resource: 1. **COMPLETE and not force** — skip (no-op) 2. **PARTIAL and not can_recover** — delete ``transient_path``, set NONE 3. **PARTIAL and can_recover** — leave ``transient_path`` in place for resumption 4. Call ``resource.download(force)`` — resource writes to ``transient_path`` 5. **On success** — move ``transient_path`` → ``path``, set COMPLETE 6. **On failure** — if ``can_recover``, set PARTIAL; otherwise delete and set NONE 7. **Eager cleanup** — for each transient dependency where all dependents are COMPLETE, call ``cleanup()`` State Metadata File =================== Resource states are persisted in ``/.downloads/.state.json``: .. code-block:: json { "version": 1, "resources": { "TRAIN_IMAGES": {"state": "complete"}, "TRAIN_LABELS": {"state": "partial"} } } .. _deprecated-download-decorators: Deprecated: Download Decorators =============================== .. deprecated:: The decorator-based API still works but emits deprecation warnings. Migrate to the class-attribute approach described above. Download decorators are applied above the ``@dataset`` decorator and pass downloaded file paths as arguments to the dataset function. .. code-block:: python from datamaestro.download.single import filedownloader from datamaestro.definitions import dataset # DEPRECATED — use class-attribute approach instead @filedownloader("data", "http://example.com/data.csv") @dataset(MyData) def my_dataset(data): # 'data' receives the downloaded Path return MyData(path=data) filedownloader (decorator) -------------------------- .. code-block:: python # DEPRECATED @filedownloader("data.csv", "http://example.com/data.csv") @dataset(MyData) def compressed_dataset(data): return MyData(path=data) concatdownload (decorator) -------------------------- .. code-block:: python # DEPRECATED @concatdownload( "combined", "http://example.com/part1.txt", "http://example.com/part2.txt", ) @dataset(MyData) def concatenated_dataset(combined): return MyData(path=combined) zipdownloader / tardownloader (decorator) ----------------------------------------- .. code-block:: python from datamaestro.download.archive import zipdownloader, tardownloader # DEPRECATED @zipdownloader("data", "http://example.com/archive.zip") @dataset(MyData) def zipped_dataset(data): return MyData(path=data / "file.csv") # DEPRECATED @tardownloader("data", "http://example.com/archive.tar.gz") @dataset(MyData) def tar_dataset(data): return MyData(path=data / "file.csv") Multiple decorators (deprecated) --------------------------------- .. code-block:: python # DEPRECATED @filedownloader("train", "http://example.com/train.csv") @filedownloader("test", "http://example.com/test.csv") @dataset(MyData) def multi_resource_dataset(train, test): return MyData(train_path=train, test_path=test) Custom handler (deprecated) ---------------------------- .. code-block:: python from datamaestro.download import Download # DEPRECATED — use FileResource / FolderResource instead class MyDownload(Download): def __init__(self, varname, custom_param): super().__init__(varname) self.custom_param = custom_param def prepare(self): return self._download_and_process() def download(self, force=False): if force or not self._is_cached(): self._do_download() def hasfiles(self) -> bool: return True Deprecated Names ---------------- .. list-table:: :header-rows: 1 :widths: 40 60 * - Deprecated - Replacement * - ``Download`` (base class) - ``Resource`` * - ``hasfiles()`` - ``has_files()`` * - ``Resource.definition`` - ``Resource.dataset`` * - ``Resource.varname`` - ``Resource.name`` * - ``@filedownloader(...)`` (decorator) - ``FileDownloader(...)`` (class attribute) * - ``SingleDownload`` - ``FileDownloader``