HPC Services staff members Yong Qin and Michael Jennings gave talks highlighting their respective software tools, wwibcheck and NHC, at the DellXL High Performance Computing (HPC) conference this week, April 21-23, 2015, in Boulder, Colorado (agenda).
Most HPC systems rely on a high-performance, low-latency interconnect network to connect compute nodes together in a way that supports tightly-coupled computations, where the compute nodes need to exchange a lot of information as part of the computation. Yong’s talk will focus on how to troubleshoot failures in HPC infiniband interconnects using his software tool, wwwibcheck, which helps the system administrators isolate and identify infiniband equipment failures or performance problems affecting the execution time of compute jobs.
Michael Jennings will also be giving a talk on his Warewulf Node Health Check (NHC) utility software. NHC runs in conjunction with the system’s job scheduler, carrying out a pre-check to detect potential problems with compute nodes before the job starts, optionally marking bad nodes as “offline.” This highly configurable utility works with popular job schedulers, such as SchedMD’s Slurm job scheduler, and Adaptive Computing’s Moab scheduler and TORQUE resource manager.
Yong and Michael are part of High Performance Computing Services Group in the IT Division that supports the Lawrencium computational cluster for the use of Berkeley Lab PIs.