Cluster Question (fwd)
Gerardo Andres Cisneros
andres at chem.duke.edu
Tue Mar 20 08:48:27 PST 2001
Hello All,
As I have said below, I have built a very small cluster (8 nodes) running
a slightly modified version of RedHat Linux 6.2 and I'm trying to run a
parallel version of a computational chemistry program (g98).
This program uses Linda for the paralellization but I'm having problems
with it.
As stated below I'm having problems with either g98 or Linda killing the
processes on the slave nodes once they're done. We've looked into a bunch
of things including hardware malfunction but everything seems Ok.
We have checked almost everything Dr. Brown suggested as per his
experience with PVM (included below) but we can find no problems in the
Linda conf file or the UID's belonging to a different user or the dameons
not running.
I was wondering if anyone out there is using Linda and/or g98 and has
encountered similar problems?.
Any help is greatly appreciated.
I would also very much apreciate if you could reply directly to me since
I'm not subscribed to this list.
Thank you very much in advance,
Best Regards,
Andres
--
G. Andres Cisneros
Department of Chemistry
Duke University
andres at chem.duke.edu
---------- Forwarded message ----------
Date: Tue, 20 Mar 2001 11:11:28 -0500 (EST)
From: Robert G. Brown <rgb at phy.duke.edu>
To: Gerardo Andres Cisneros <andres at chem.duke.edu>
Subject: Re: Cluster Question
On Tue, 20 Mar 2001, Gerardo Andres Cisneros wrote:
>
> Dear Prof. Brown,
>
> I'm a grad student working for Dr. W. Yang at the Chemistry Dept.
>
> We have built a beowulf cluster using 8 Dell PC's donated by intel, i
> installed Dulug Linux 6.2 on all of them and I am now trying to run some
> programs in parallel.
>
> Specifically I'm trying to run Gaussian98 on it so I had to download Linda
> which is basically software based shared memory (virtual shared memory).
>
> I was wondering if you had ever used this software and if so if I could
> get some pointers.
Unfortunately I've never used G98 or Linda either one, so I don't know
how helpful I can be. I'd recommend posting the problem to the beowulf
list though, as there are probably folks out there who have used the two
together.
> My problem is that every time I try to run a big job on more than one node
> the program crashes before finnishing. The program is supposed to kill
> the processes on the slave nodes but it doesn't do it so they just sit on
> the slave nodes occupying memory until eventually one of the nodes just
> runs out of memory and the process dies.
>
> If I do a run with a veryverbose flag for linda I get a bunch of "Killed
> by signal 15" messages stating that it killed the remote processes when
> they're done but it doesn't actually do it.
>
> A message to CCL produced a bunch of replies telling me to upgrade the
> kernel which I did (from 2.2.16-3 to 2.2.17-4) but still no go.
>
> Somebody else told me that he once had a simmilar problem but it was
> caused by bad grounding of his network cards so static electricity was
> building up and crashing his machines but I doubt that is the case here
> since the network card is chipset to the motherboard. We have 8 Dell
> Optiplex (I'm sorry I didn't mention that before).
>
> I would very much appretiate any suggestions you might have on this.
I doubt very much that it is static electricity, and our Dells (probably
from the same batch as yours) are rock stable under load and running a
nearly identical setup. Besides, I can only assume that all the chassis
are plugged into properly grounded three prong plugs and sit on a rack
of some sort as well. I've never had any instabilities of any systems
anywhere that I could identify with static electricity although perhaps
you might if you had some sort of active source of high voltage nearby
(a van DeGraff accelerator, a tesla coil, or some such). Ordinarily
the ground wire of the power cable is connected to the chassis and
absolutely prevents the buildup of static on connected components.
Besides, this would be more likely to kill your whole computer than to
just shut down one particular process. You haven't had any problems
running e.g. NFS have you? Or connecting and transferring large files
via scp? Why would a hardware problem pick on G98 with this whole raft
of things to choose from?
A problem in Linda seems much, much more likely especially given that it
is failing to to successfully kill the remote processes when it claims
that it is doing so. I've encountered the identical problem in recent
versions of PVM -- the pvm_kill command is there, but I'll be damned if
I could ever make it actually kill off the slaves in a master-slave
calculation. Curiously, they could be killed off from the daemon
command interface, so PVM had the capability -- there was just some sort
of bug in the command implementation.
I wish I could be of some help to you as you try to figure this out, but
there isn't a lot I can think of trying without any hands on experience
with Linda/G98. One thing might be permissions -- perhaps the remote
slaves are being spawned but end up belonging to a UID that doesn't
correspond with the source of the kill signal so that the kill signal is
ignored, for example. If you can, look in the /var/log/messages on the
slave nodes and see what kinds of things are being logged at the time of
a kill. Look in the slave sources and see what the signal handler is
doing. Snoop the net and verify that there are packets being sent that
actually contain the kill signal. Run a remote host monitor tool (e.g.
procstatd and watchman from the brahma site in physics) on the nodes and
watch e.g. their memory consumption and network and CPU load -- is the
problem a simple memory leak somewhere?
Still, I think your best bet is the beowulf list itself. Surely
somebody on it can help you better than I am able to.
rgb
--
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf
mailing list