[Beowulf] Monitoring and Metrics

Josh Catana jcatana at gmail.com
Sat Oct 7 05:21:08 PDT 2017


This may have been brought up in the past, but I couldn't find much in my
message  archive.
What are people using for HPC cluster monitoring and metrics lately? I've
been low on time to add features to my home grown solution and looking at
some OTS products.
I'm looking for something that can do monitoring, alert on condition,
broken hardware, etc.
Also something that does system resource utilization metrics. If it has a
plug-in for a scheduling system like PBS where I can correlate a job ID to
the metrics of the systems it is currently running on or previously ran on
at the time, that would be an amazing plus.
Any of you beowulfers have any suggestions?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20171007/393190e8/attachment.html>


More information about the Beowulf mailing list