[Beowulf] Odd SuperMicro power off issues

stephen mulcahy smulcahy at aplpi.com
Mon Dec 8 04:59:00 PST 2008


Chris Samuel wrote:
> Very occasionally we find one of our Barcelona nodes with
> a SuperMicro H8DM8-2 motherboard powered off.  IPMI reports
> it as powered down too.

Hi Chris,

We had a similar exerience with one of our compute nodes - intermittent 
power-offs when running our model and absolutely nothing in the logs. I 
modified Ganglia to track voltage and temp in an effort to see if 
anything unusual happened to those before-hand but there was no 
discernable trends.

I can memtest86+ a number of times on the problem node and neither it 
nor mcelog showed any problems.

Subsequent to that, I found aBIOS upgrade for those systems which 
included an Opteron microcode update to fix an AMD processor erratum 
(sp?) - I can dig out the details if the specific problem is of interest.

Around the same time, we finally started to see memory errors, so we 
also replaced the bad mmory in the system.

Unfortunately I can't tell you which was responsible for fixing the 
problem. My understanding is that Fluent is quite memory and I/O 
intensive - do you run other equally intensive models without seeing the 
failure?

Anyways, in summary - if you're totally stumped - try swapping out the 
memory and/or rolling to the latest firmware and see if that improves 
the stability.

-stephen

-- 
Stephen Mulcahy       Applepie Solutions Ltd.      http://www.aplpi.com
Registered in Ireland, no. 289353 (5 Woodlands Avenue, Renmore, Galway)



More information about the Beowulf mailing list