[Beowulf] non-stop computing
Christopher Samuel
samuel at unimelb.edu.au
Tue Oct 25 21:07:05 PDT 2016
On 26/10/16 14:45, John Hanks wrote:
> I'd suggest making NFS mounts hard, so processes can recover from an NFS
> server reboot.
...plus set the NFS fsid for each export server side so they come back
reproducibly each time...
PS: I endorse what John said (now I've finished laughing), I'd suggest
making sure you've at least got ECC memory though and RAID as those are
the two parts that can go bad. When we had clusters with disks in
compute nodes those were the most frequent failures, now we run diskless
nodes it's memory DIMMs. :-)
All the best,
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
More information about the Beowulf
mailing list