Catalogs

Tobias Kölling

2024-10-20

Catalogs, why?

for users

simplify access to data

import os
import urllib.request
import xarray
import shutil

if not os.path.exists("some_data"):
    urllib.request.urlretrieve("https://example.org/some_data.zip", "some_data.zip")
    shutil.unpack_archive("some_data.zip", "some_data")

ds = xr.open_mfdataset("some_data/*.nc")

vs

import intake
cat = intake.open_catalog("https://example.org/catalog.yaml")
ds = cat["some_data"].to_dask()

make datasets findable

e.g. STAC datasets

Metadata in catalogs can be accessed faster
than when burried inside datasets.

This enables quick browse, search and quicklook tools.

for dataset providers

simplify data movement

Once data is moved, just update the catalog and users seemlessly access data from new location.

simplify encoding changes

  • Catalog describes how to open the data
  • data encoding can be changed (zipped CSV -> HDF5 -> Zarr)
  • Users automatically use new one after catalog update

aid with distributed access

  • Returned catalog entries may depend on user location
  • Users may use the same code to access data everywhere, but still can be directed to a copy in the local datacenter
  • This may even involve HDF5 on lustre in one datacenter and Zarr on S3 in another

hack around broken datasets

😬

  • Complex catalog entries can be used to concatenate, mix, slice etc… a collection of poorly prepared datasets.

  • May be better than nothing, but usually comes with bad performance impact.

catalog basics

catalog

A list / tree / collection of catalog entries.

May be static, dynamic, searchable, etc.

catalog entry

  • has an identity
  • can be retrieved
  • locates (or identifies) a dataset
  • instructs how to open a dataset
  • may carry additional metadata

implementations

filesystem directories

  • ✅ can be a simple option
  • ✅ support symlinks
  • ❌ not really a catalog (doesn’t aggregate metadata)
  • ❌ only shows what’s on the filesystem

Intake yaml

  • ✅ easy to create
  • ✅ compatible with any kind of data
  • ❌ limited to Python
  • ❌ unstable format (Intake 2 broke a lot of things)
  • 🤔 has room for creative hacks

SpatioTemporal Asset Catalogs (STAC)

  • ✅ stable format
  • integrations for many languages
  • ✅ can be used with Intake
  • ❌ more complicated to create (but tools exist)
  • ❌ can only be used for spatio-temporal datasets

Intake ESM

  • made for CMIP6
  • aims at assembling big datasets out of many individual datasets, which I wouldn’t recommend

THREDDS Dataset Inventory Catalogs

  • specific catalog for THREDDS data server (e.g. OPeNDAP)
  • exposes what’s available on that specific server

for the hackathon

ease of access

There are many computing facilities.
We want to work together.

cat = get_hackathon_catalog()
ds = cat.get_dataset("some_model_run_output_id")
  • concise
  • fast
  • across data centers
  • no local code changes
  • supports different storage methods and formats