[Beowulf]
Reuti
reuti at staff.uni-marburg.de
Wed Mar 23 15:25:30 PST 2005
Hi,
I'd suggest to move over to the SGE users list at:
http://gridengine.sunsource.net/servlets/ProjectMailingListList
But anyway, let's sort the things out:
Quoting William Burke <wburke999 at msn.com>:
> I can't get PE to work on a 50 node class II Beowulf. It has a front-end
> Sunfire v40 (qmaster host) and 49 Sunfire v20s (execution hosts) running
> Linux configured to communicate data over Myrinet using MPICH-GM version
> 1.26.14a.
Although there is a special Myrinet directory, you can also try to use the
files in the mpi directory instead.
> These are the requirements of the N1GE environment to handle:
>
> 1. Serial type jobs for pre-processing the data - average runtime 15
> minutes.
> 2. Output is pipelined into parallel processing jobs - range of runtime
> 1- 6 hours.
> 3. Concurrently running is post-processing serial jobs.
>
> I have setup a Parallel Environment called mpich-gm and a straight-forward
> FIFO scheduling schema for testing. When I submit parallel jobs they hang
> in
> limbo in a 'qw' state pending submission. I am not sure why the scheduler
> does not see jobs that I submit.
>
>
>
> I used the myrinet mpich template located $SGE_ROOT/< sge_cell
> >/mpi/myrinet
> directory to configure the pe (parallel environment) plus I copied the
> sge_mpirun script to the $SGE_ROOT/< sge_cell >/bin directory. I
> configured
> a Production.q queue that runs only parallel jobs. As a last sanity check I
> ran a trace on the scheduler, submitted a simple parallel job, and this is
> the results that I got from the logs:
Can you please give more details of your queue and PE setup (qconf -sq/sp
output).
> JOB RUN Window
>
> [wems at wems examples]$ qsub -now y -pe mpich-gm 1-4 -b y hello++
>
> Your job 277 ("hello++") has been submitted.
>
> Waiting for immediate job to be scheduled.
>
>
>
> Your qsub request could not be scheduled, try again later.
>
> [wems at wems examples]$ qsub -pe mpich-gm 1-4 -b y hello++
>
> Your job 278 ("hello++") has been submitted.
>
> [wems at wems examples]$ qsub -pe mpich-gm 1-4 -b y hello++
>
> Your job 279 ("hello++") has been submitted.
You can't start a parallel job this way, as there is no mpirun used. When you
used your mentioned script, you get the same behavior (and there you used
mpirun -np $NSLOTS ...)?
> This is the 2nd window SCHEDULER LOG
>
> [root at wems bin]# qconf -tsm
>
> [root at wems bin]# qconf -tsm
>
> [root at wems bin]# cat /WEMS/grid/default/common/schedd_runlog
>
> Wed Mar 23 06:08:55 2005|-------------START-SCHEDULER-RUN-------------
>
> Wed Mar 23 06:08:55 2005|queue instance "all.q at wems10.grid.wni.com" dropped
> because it is temporarily not available
>
> Wed Mar 23 06:08:55 2005|queue instance "Production.q at wems10.grid.wni.com"
> dropped because it is temporarily not available
>
> Wed Mar 23 06:08:55 2005|queues dropped because they are temporarily not
> available: all.q at wems10.grid.wni.com Production.q at wems10.grid.wni.com
>
> Wed Mar 23 06:08:55 2005|no pending jobs to perform scheduling on
>
> Wed Mar 23 06:08:55 2005|--------------STOP-SCHEDULER-RUN-------------
>
> Wed Mar 23 06:11:37 2005|-------------START-SCHEDULER-RUN-------------
>
> Wed Mar 23 06:11:37 2005|queue instance "all.q at wems10.grid.wni.com" dropped
> because it is temporarily not available
>
> Wed Mar 23 06:11:37 2005|queue instance "Production.q at wems10.grid.wni.com"
> dropped because it is temporarily not available
>
> Wed Mar 23 06:11:37 2005|queues dropped because they are temporarily not
> available: all.q at wems10.grid.wni.com Production.q at wems10.grid.wni.com
>
> Wed Mar 23 06:11:37 2005|no pending jobs to perform scheduling on
>
> Wed Mar 23 06:11:37 2005|--------------STOP-SCHEDULER-RUN-------------
>
> [root at wems bin]# qstat
>
> job-ID prior name user state submit/start at queue
> slots ja-task-ID
>
> ----------------------------------------------------------------------------
> -------------------------------------
>
> 279 0.55500 hello++ wems qw 03/23/2005 06:11:43
> 1
>
> [root at wems bin]#
Do you have an admin account for SGE? I'd prefer not to do anything in SGE as
root.
> BTW that node wems10.grid.wni.com has connectivity issues and I have not
> removed it from the cluster queue.
>
>
>
> What causes this type of problem in N1GE to return "no pending jobs to
> perform scheduling on" in the schedd_runlog even though there are available
> slots ready to take jobs?
>
> I had no problem submitting serial jobs, only the parallel jobs resulted as
> such. Are there N1GE - Myrinet issue that I am not aware of? FYI the same
> binary (hello++) runs with no problems from the command line.
If you just start hello++, it will not run in parallel I think.
Not really an issue: you have to make a small change to the mpirun.ch_gm.pl to
make all jobs staying in the same process group to get them correctly killed in
case of a jobb abort:
http://gridengine.sunsource.net/howto/mpich-integration.html
> Since I generally run scripts from qsub instead of binaries I created a
> script to run the mpich executable but that yield the same result.
>
>
>
> I have an additional question regarding setting a queue.conf parameter
> called "subordinate_list". How is it read from the result of qconf -mq
> <queue_name>?
>
> Example
>
> i.e., subordinate_list low_pri.q=5,small.q.
The queue "low_pri.q" will be suspended, when 5 or more slots of "<queue_name>"
are filled. The "small.q" will be suspened, if all slots of "<queue_name>" are
filled.
Cheers - Reuti
More information about the Beowulf
mailing list