Hangs
Louis J. Romero
louisr at aspsys.com
Thu Aug 1 07:49:04 PDT 2002
hi,
I'd have two (2) suggestions that might help.
The first would be to put a head on the machine so that you are not blind
when the system hangs. If that is not an option, load up a cron job that
runs maybe every 5 minutes (you'll have to throttle this to give you a period
that can give you info @ or around the time of the hang I would suggest that
you maybe dump a long listing of the process table e.g. ps -wef --forest.,
maybe socket info via netstat -a or socklist, nfs data using netstat, disk
stats using iostat, swap using free, mounted file systems can be viewed using
df (note: put this command in last because if nfs is the culprit, the df
command will hang). If you happen to know the process that is hanging, run
it with strace with the output going to a file. May slow things down a bit
but, you're in triage @ this point.
As an aside, why are the nfs mount points hard? nfs problems with a hard
mount option can cause a machine to hang. Depending upon the load that the
clients are putting on the server, increasing the number of nfs daemons may
relieve some botleneck that may be introduced. Conversely, too many can
cause performance degradation.
Good luck...
Louis
On Wednesday 31 July 2002 02:25 pm, Jean-Christophe Ducom wrote:
> The nodes of our cluster are:
> Dell Workstation Dual Xeon 1.7GHz 1GB RAM, RedHat 7.2 running 2.4.18
> patched for IRQ balancing, Syskonnect SK9D21 GigEthernet
>
> The cluster is heavily used for mpi programs using MPICH 1.2.4
> Each node mount NFS directories w/ the following options:
> rw,nosuid,nodev,hard,intr,rsize=8192,wsize=8192
>
> ACPI is installed to overcome some APM issues w/ the poweroff command on
> SMP machines.
>
> But some nodes hang sometimes for unknown reasons. They don't crash
> though (they would reboot anyway: cat /proc/sys/kernel/panic -> 0 ).
> There is no way to conect to them.
> I installed serial console on some nodes (cf. my previous email about
> remote serial console). When I connect thru the serial console to a hang
> node, I even can't reboot the node BUT minicom shows that the machine is
> ONLINE.
> It happens most of the time when MPI programs establish communications
> between nodes.
> What's going on? NFS hangs (but nothing in the /var/log/message and
> other)? ACPI problem? Does the console dies? Switch issues?
>
> Any ideas?
>
> Thanks
>
> JC
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
--
Louis J. Romero
Email: louisr at aspsys.com
Local: (303) 431-4606
Aspen Systems, Inc.
3900 Youngfield Street
Wheat Ridge, Co 80033
Toll Free: (800) 992-9242
Fax: (303) 431-7196
URL: http://www.aspsys.com
More information about the Beowulf
mailing list