Analysis of HDF5 Access Patterns

In high-performance computing (HPC), applications usually distribute their data across several compute nodes to perform the application's computation in parallel. Applications such as Enzo often store their data in so-called self-describing data formats, in this case HDF5 (Hierarchical Data Format). They allow storing additional metadata (attributes) to describe the variables. This way files can be exchanged easily between different research sites without transferring extensive information about the structure of the files as this is encoded in the file format. Given a multidimensional array, the attributes will describe the data stored in each dimension, e.g., the first one is the time, the second the air temperature at an altitude of two meters.

In climate research, data sizes quickly accumulate to several petabytes a year which need to be stored for decades. Due to acquisition and maintenance costs, tape archives are still the most feasible solution. However, the introduced latency is enormous which is why the selection process is especially important.

An example: If one is interested in the extreme temperatures that have been measured over the last 50 years in Europe, it is helpful to know which tape actually contains the interesting data. If the data is simply stored inside the HDF5 file, then the whole file, easily several terabytes, has to be loaded from tape to be analysed. As part of our current research, we dissect such formats and store the actual data in an object store and the attributes in a database. This way the metadata can be queried directly to ease post-processing. For our example, we can query the database and check whether the temperature of a specific region was higher/lower than a threshold, narrowing down which data needs to read. Furthermore, often computed results such as mean values can also be stored in the database directly reducing the tape accesses even more.

The topic of this bachelor thesis is to explore the access patterns of an HDF5 application such as Enzo by logging the I/O calls in JULEA. These logs can then be used to examine which database concept and which specific implementation is most suitable for such access patterns.

Contact: Kira Duwe

Last Modification: 17.09.2020 - Contact Person: Webmaster