Debugging Solution for SPMD Applications
Parallel applications running on supercomputers can most often be categorized as SPMD (Single-Program Multiple-Data). Instead of running multiple threads on one shared memory system, they consist of many processes that are running on multiple compute nodes of a cluster. The communication between different parts of the application is done via parallelization libraries such as MPI (Message Passing Interface).
Developing such parallel applications can get quite complex and is very error-prone. The parallelism introduces multiple new error classes such as data races or deadlocks. Furthermore, it also becomes much harder for a developer to find the bugs in their code.
Having good debugging tools becomes essential when working on large scale programs. In the case of MPI applications, there are existing debugging solutions but they are mostly proprietary and therefore in practice not obtainable for a large group of developers.
As part of this thesis, you will explore possible solutions for developing an interactive debugger that is able to monitor multiple processes that are part of an SPMD application. An important first step will be researching suitable libraries and approaches on how to implement this task. Working on this topic will most likely require you to work with some low-level C libraries such as the GDB backend.
- https://sourceware.org/gdb/onlinedocs/gdb/
- https://sourceware.org/gdb/papers/libgdb2/libgdb_toc.html
- https://sourceware.org/gdb/onlinedocs/gdb/GDB_002fMI.html
- http://davis.lbl.gov/Manuals/GDB/gdb_17.html#SEC137
- https://www.embecosm.com/appnotes/ean4/embecosm-howto-rsp-server-ean4-issue-2.html
Contact: Michael Blesel