Interactive Storage Layout Visualization of HDF5 Object Placement

Scientists across the world are relying on self-describing data formats such as HDF5 and NetCDF to store scientific data and exchange results. With growing data generation capabilities, as provided by high-resolution scientific instruments (think telescopes, sensors in particle accelerators, and satellites for remote sensing) as well supercomputer simulations, scientific discovery is limited by the ability to store and access large volumes of data. Unsurprisingly, data access performance is severely impacted by the structure and organization of the previously mentioned self-describing data formats. A key challenge to achieve optimal storage performance is thus to cleverly map and align these data structures to play well with the underlying technical systems. Because scientific users and supporting software libraries are utilizing a wide range of different optimization strategies, the emerging mappings from logical subsets of data to a concrete on-storage serialization are often not intuitively conceptualized. Here, visualizations from an I/O perspective can help to demystify how data is laid out in detail.

HDF5 also provides rich metadata about the in-file position and size of internal objects such as groups, datasets, and chunks of an HDF5 file. While these are sometimes consulted when optimizing I/O performance, an interactive exploration tool easily accessible to users and developers through, for example, Jupyter notebooks is missing.

As part of this thesis you will develop a specification and implement a Python module to conveniently access, expose for reuse, and visualize this information. Besides static visualizations using matplotlib also interactive visualization components which integrate with Jupyter Notebooks should be considered. This topic allows you to try out and advance your Python and JavaScript (for cross-platform visualization) programming knowledge. You will learn about the low-level details of scientific data formats relied on by supercomputing centers around the world as well as major scientific institutions such as CERN or NASA. You will also learn advanced techniques to express yourself using Jupyter notebooks and popular JavaScript visualization libraries such as D3.js or Three.js.

Contact: Michael Kuhn and Jakob Lüttgau

Last Modification: 11.01.2021 - Contact Person: Webmaster