Storing data

Florian Ziemen, Tobi Kölling, Lukas Kluft

Datasets

Without subset access and hierarchies, analysis scripts are forced to load way too much data.

Make a compromise that’s okay-ish for everybody by chunking along all dimensions.

Any access will require uncompressing entire chunks.
By keeping them small to reduce the amount of data that will be uncompressed but not used.
Keep them big enough for the compressor to do its job.
Usually MB-ish blocks are a good compromise.

As chunks are stored separately, this scales for any size of dataset. We are working with a 500 TB dataset in nextGEMS.

We can index a multi-file HDF5 dataset with kerchunk, and then create a pseudo-filesystem zarr with fsspec in python.
Allows to treat a set of netCDF4 files as one zarr dataset.
Direct access only via python.
A simple python web server can present it as zarr via https for other languages.

Storage layout, chunk shapes	Read time series (sec)	Read spatial slice (sec)	Performance bias (slowest / fastest)
Contiguous favoring time range	0.013	180	14000
Contiguous favoring spatial slice	200	0.012	17000
Default (all axes equal) chunks, 4673 x 12 x 16	1.4	34	24
36 KB chunks, 92 x 9 x 11	2.4	1.7	1.4
8 KB chunks, 46 x 6 x 8	1.4	1.1	1.2