job queue/farm out suggestions
Kim Branson
bra369 at pp.molsci.csiro.au
Tue Jan 9 04:33:12 PST 2001
I've sucessfully used enfuzion (from turbo linux) for such a task. it is
able to to do all you ask, but it is not free. mind you it is an extremely
useful program,
I played around with scripts and the like with condor, but since we got
enfuzion to use all our other boxes after hours etc, running it on
my cluster made life really easy...enabling me to get on with the science.
you can get a free evaluation licence for 10 nodes for a month.
just my 2 cents...but with the value of the aussie dollar its probably
alot less
cheers
kim
______________________________________________________________________
Mr Kim Branson
Phd Student
Structral Biology
Walter and Elisa Hall Institute
Royal Parade, Parkville, Melbourne, Victoria
Ph 61 03 9662 7136
Email kim.branson at bioresi.com.au
______________________________________________________________________
On Tue, 9 Jan 2001, Robert G. Brown wrote:
> On Mon, 8 Jan 2001, Bill Comisky wrote:
>
> > hi all,
> >
> > I've been using some homegrown code for quite a while now to farm out
> > independent jobs (no interprocess communication required) to the nodes of
> > a cluster. I was starting to add features and improve scalability, and I
> > thought I would look around to see if there was something better elsewhere
> > I could use instead, maybe with some simple front end scripts.
> >
> > I'd like to find out what other "embarassingly parallel" cluster users use
> > to farm jobs out. My jobs are the fitness evaluations of the population
> > of a single generation of a genetic algorithm optimization.
>
> I've actually worked on (curiously enough) parallelizing the same
> problem. There is a short article on this on brahma
> (www.phy.duke.edu/brahma). I used PVM in a master/slave programming
> paradigm -- this handles (or can be made to handle) the job control
> features and even load balancing but not the partitioning or shared
> usage problems.
>
> However, I've also used scripts to accomplish the same kind of work.
> I'd recommend using perl instead of /bin/sh -- if the genetic algorithm
> you are using is written to be incredibly coarse grained (as you say,
> embarrassingly parallel with generations managed via single job
> submissions) this is fine. In my case, the granularity was not so
> favorable and I actually had to work some to get decent parallel gain
> (the fitness evaluation didn't necessarily take a long time wrt the time
> required to partition the work, farm it out, and collect the results to
> ready the next generation).
>
> Last suggestion -- consider e.g. Mosix and/or possibly a tool like
> Condor for the load balancing and partitioning. You'll probably have to
> do some code crafting to really build in all the features, especially
> node failure tolerance. I plan to work on this myself over the course
> of the spring.
>
> rgb
>
> >
> > Some of the features I would like are:
> >
> > - some way to know when a particular set of jobs (one generation of
> > individuals in my case) is done
> >
> > - a way to partition the cluster so that one user gets nodes 1:N, and
> > another user gets nodes N+1:M.
> >
> > - allow multiple users to share the entire cluster, while limiting the
> > runs started on each node to that nodes memory/processor limitations.
> >
> > - tolerance of node failures (node goes down) or job failures (run hangs)
> >
> > - an open source/free solution would be preferred
> >
> > - process migration might be interesting to try out, but I don't think it
> > is essential for what I'm doing.
> >
> > Right now I use some scripts and rsh, and I redirect the program input
> > through rsh and either redirect the output through rsh or write remotely
> > to an NFS mounted directory. Do the queueing systems you use require NFS?
> >
> > I've started looking at GNU queue, anyone have experience with it? Is
> > there a way to tell if a set of jobs has been completed (would you have to
> > collect PID's and check the process list continually?).
> >
> > MOSIX sounds interesting, but it sounds like I would still need some
> > queueing routine. Meaning, if I have 1000 jobs and 20 processors, MOSIX
> > won't wait for the first 20 to finish before starting another 20,
> > right? It just tries to balance all 1000 over the cluster.
> >
> > thanks for any input!
> > bill
> >
> >
>
> --
> Robert G. Brown http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
>
>
>
>
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
>
More information about the Beowulf
mailing list