[Beowulf] Tight MPICH2 Integration with SGE
Sangamesh B
forum.san at gmail.com
Fri Jan 25 06:41:20 PST 2008
Hi all,
I'm doing the Tight MPICH2 (not MPICH) Integration with SGE on a
cluster with, dual core dual AMD64 opteron processor.
Followed the sun document located at:
http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html
The document explains following three kinds of TI:
Tight Integration(TI) using Process Manager(PM): gforker
TI using PM: SMPD – Daemonless
TI using PM: SMPD – Daemonbased
I did the TI with gforker and tested it successfully.
But failed to do TI with daemonless-SMPD.
Let me explain what I did.
Installed the MPICH2 with smpd configuration.
The sge is installed at: /opt/gridengine
And created MPICH2-SM folder in /opt/gridengine/mpi by referring the
following lines from the document
start_proc_args /usr/sge/mpich2_smpd_rsh/startmpich2.sh -catch_rsh
$pe_hostfile
stop_proc_args /usr/sge/mpich2_smpd_rsh/stopmpich2.sh
Copied the startmpi.sh, stopmpi.sh from /opt/gridengine/mpi to
/opt/gridengine/mpi/MPICH2-SM dir, because nothing has given in the doc what
to include in these scripts.
Using qmon, created MPICH2-GF pe.
# qconf -sp MPICH2-SM
pe_name MPICH2-SM
slots 999
user_lists rootuserset
xuser_lists NONE
start_proc_args /opt/gridengine/mpi/MPICH2-SM/startmpich2sm.sh
stop_proc_args /opt/gridengine/mpi/MPICH2-SM/stopmpich2sm.sh
allocation_rule $round_robin
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
Added this PE to default queue all.q.
Then submitted the job with following script:
# cat sgeSM.sh
#!/bin/sh
#$ -cwd
#$ -pe MPICH2-SM 4
#$ -e msge2.Err
#$ -o msge2.out
#$ -v MPI_HOME=/opt/MPI_LIBS/MPICH2-GNU/MPICH2-SM/bin
#$ -v MEME_DIRECTORY=/opt/MEME-MAX
$MPI_HOME/mpiexec -np 4 -machinefile /root/MFM /opt/MEME-MAX/bin/meme_p
/opt/MEME-MAX/NCCS/samevivo_sample.txt -dna -mod tcm -nmotifs 10 -nsites 100
-minw 5 -maxw 50 -revcomp -text -maxsize 200500
It gave following error:
# cat msge2.Err
startmpich2sm.sh: got wrong number of arguments
rm: cannot remove `/tmp/92.1.all.q/machines': No such file or directory
rm: cannot remove `/tmp/92.1.all.q/rsh': No such file or directory
I guess the problem might be with the scripts startmpich2sm.sh and
stopmpich2sm.sh.
Can any one guide me to resolve this issue..
Thanks & Regards,
Sangamesh
HPC Engineer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20080125/279bdef4/attachment.html>
More information about the Beowulf
mailing list