Dealing with Badvoltage-Induced Failures in HPC

Hello Everyone,

I hope this message finds you all well. I’m reaching out today because our High-Performance Computing (HPC) cluster has been experiencing some serious issues related to Badvoltage-Induced Failures, and we’re in desperate need of your expertise to resolve this matter.

Problem Overview: In recent weeks, our HPC cluster has been plagued by unexpected failures and downtime, which we suspect are caused by voltage-related issues. These failures are affecting our research and workloads significantly, and we’re struggling to pinpoint the root causes.

Symptoms:

  1. Sudden node crashes and reboots without apparent triggers.
  2. Inconsistent performance degradation in some nodes during peak usage.
  3. Unexplained system errors and kernel panics in the logs.

Despite our best efforts, we’re struggling to pinpoint the exact causes of these Badvoltage-Induced Failures. We’re reaching out to this knowledgeable community for guidance, as we believe that some of you may have faced similar challenges.Please share your experiences, insights, and recommendations.

Thank you in advance
Best Regards,