[Beowulf] Weird problem with mpp-dyna

Wed Mar 14 09:37:25 PDT 2007

On Wed, 14 Mar 2007, Peter St. John wrote:

> I just want to mention (not being a sysadmin professionally, at all) that
> you could get exactly this result if something were assigning IP addresses
> sequentially, e.g.
> node1 = foo.bar.1
> node2 = foo.bar.2
> ...
> and something else had already assigned 13 to a public thing, say, a
> webserver that is not open on the port that MPI uses.
> I don't know nada about addressing a CPU within a multiprocessor machine,
> but if it has it's own IP address then it could choke this way.

On the same note, I'm always fond of looking for loose wires or bad
switches or dying hardware on a bizarrely inconsistent network
connection.  Does this only happen in MPI?  Or can you get oddities
using a network testing program e.g. netpipe (which will let you test
raw sockets, mpi, pvm in situ)?

   rgb

>
> Peter
>
>
> On 3/14/07, Joshua Baker-LePain <jlb17 at duke.edu> wrote:
>> 
>> I have a user trying to run a coupled structural thermal analsis using
>> mpp-dyna (mpp971_d_7600.2.398).  The underlying OS is centos-4 on x86_64
>> hardware.  We use our cluster largely as a COW, so all the cluster nodes
>> have both public and private network interfaces.  All MPI traffic is
>> passed on the private network.
>> 
>> Running a simulation via 'mpirun -np 12' works just fine.  Running the
>> same sim (on the same virtual machine, even, i.e. in the same 'lamboot'
>> session) with -np > 12 leads to the following output:
>> 
>> Performing Decomposition -- Phase 3 03/12/2007
>> 11:47:53
>> 
>> 
>> *** Error the number of solid elements 13881
>> defined on the thermal generation control
>> card is greater than the total number
>> of solids in the model 12984
>> 
>> *** Error the number of solid elements 13929
>> defined on the thermal generation control
>> card is greater than the total number
>> of solids in the model 12985
>> connect to address $ADDRESS: Connection timed out
>> connect to address $ADDRESS: Connection timed out
>> 
>> where $ADDRESS is the IP address of the *public* interface of the node on
>> which the job was launched.  Has anybody seen anything like this?  Any
>> ideas on why it would fail over a specific number of CPUs?
>> 
>> Note that the failure is CPU dependent, not node-count dependent.
>> I've tried on clusters made of both dual-CPU machines and quad-CPU
>> machines, and in both cases it took 13 CPUs to create the failure.
>> Note also that I *do* have a user writing his own MPI code, and he has no
>> issues running on >12 CPUs.
>> 
>> Thanks.
>> 
>> --
>> Joshua Baker-LePain
>> Department of Biomedical Engineering
>> Duke University
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>> 
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu