Technical details for host teams
Requirements (rough):
- Combined data and compute to support at least 100 users analyzing at least 100 TB data sets.
- 1PB data resource (100TB minimum)
- Large (1000) multi-core analysis platforms with fast access to data resource (could be separate from meeting place) and JUPYTER support.
- One DYAMOND-Annual Simulation on HEALPIX (other data and other formats can be hosted as desired, but each participant must host at least one standardized dataset).
Providing the data
For providing the data at hackathons, a hierarchy of resolutions in space and time has proven very useful, where data that is available at a fine spatial or temporal resolution is also available on equivalent grids at all coarser resolution levels. See the nextGEMS blog or easy.gems for more details.
The data request
The Data request is under development, and we are working on a python script to verify that a dataset matches the request.
Grids
The HEALPix grid (Górski et al., 2004) has proven very useful for providing the data, as it features equal area cells, on isolatitude bands and yields itself naturally to a hierarchy of resolutions. It also features the option of using a cell ordering (nest) that represents the hierarchy and thus regions that are close in index space, usually also are close in geographical space. This eases reading only a region of a dataset from disks. See easy.gems for more info.
Catalogs
Grouping the datasets in catalogs allows to abstract from file system paths, and eases later dataset updates and migrations. Intake has proven useful here - mind that there are version one and two and that the two versions are not necessarily compatible.
See easy.gems for examples of the use of intake in the context of previous hackathons.