[Beowulf] using watchdog timers to reboot a hung	systemautomagically: Good idea or bad?
    Greg Lindahl 
    lindahl at pbm.com
       
    Fri Oct 23 11:23:17 PDT 2009
    
    
  
On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote:
> 2. Some errors are hardware precipitated. Aging, out-of-warranty
> aging, hardware can sometimes need such a reboot compromise for
> one-off random errors.
> 
> Maybe all the "nice" clusters out there never have this issue but for
> me it is fairly common. Just confessing.
Why, exactly, are you assuming that your freezes are one-off random
errors due to aging hardware? Sounds like you're either guessing, or
you _are_ doing forensics, but aren't calling it forensics.
-- greg
    
    
More information about the Beowulf
mailing list