[Beowulf] Monitoring and reporting Infiniband errors
John Hearns
hearnsj at googlemail.com
Thu Jun 19 07:14:38 PDT 2014
pps. I guess I could clear the errors every time this runs, but have
decided to just do an initial clear of the errors and look at the
cumulative rate.
ppps. there is a better list for this chatter, isn't there...
On 19 June 2014 15:10, John Hearns <hearnsj at googlemail.com> wrote:
> If anyone is interested, here is my solution, which seems good enough.
> Someone will no doubt say there is a neater way!
>
> A shell script which runs ibqueryerrors and returns 1 if anything is found:
>
> #!/bin/bash
> # check for errors on the Infiniband fabric 0
> # another script runs for port 1
>
> errors=`/usr/sbin/ibqueryerrors -c -s XmtWait -P0 | tail -n +2`
> if [ -n "$errors" ] ; then
> echo "Check for errors on Infiniband Fabric 0"
> echo
> echo $errors
> exit 1
> else
> exit 0
> fi
>
> For Monit monitoring, exit 0 means the service is OK, exit 1 means there
> is a problem.
>
> So in monit:
>
> check program ib0-errors with path "/usr/local/bin/check-ib0.sh"
> every "30 * * * *"
> if status == 1 then alert
> alert my.email at domain.com with reminder on 30 cycles
> set mail-format { subject: $DESCRIPTION }
>
>
>
> (ps. monit is only returning the first line - to be revised)
>
>
>
> On 19 June 2014 14:18, John Hearns <hearnsj at googlemail.com> wrote:
>
>> Does anyone have good tips on moniroting a cluster for Infiniband errors?
>>
>> Specifically Mellanox/OpenFabrics on an SGI cluster.
>>
>> I am thinking of running ibcheckerrors or ibqueryerrors and parsing the
>> output.
>>
>> I have Monit set up on the cluster head node
>> http://mmonit.com/monit/
>>
>> which I find quite good
>>
>> Also if individual nodes could use gmetric to report port errors as a
>> Ganglia metric I have the ganglia-alert script set up to send email if
>> ganglia values exceed set thresholds.
>>
>> Any ideas welcomed please.
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140619/af54cf6c/attachment.html>
More information about the Beowulf
mailing list