[Beowulf] Understanding environments and libraries caching on a beowulf cluster
leo camilo
lhcamilo at gmail.com
Tue Jun 28 15:44:03 UTC 2022
Thanks Robert,
You have given me a lot to think about.
Most of our nodes have around 250GB SSDs largely unpopulated so I am
guessing there is no harm in just installing the libraries in every node
with ansible. Also, in our department we have a wealth of old HDDs we could
repurpose
My time indeed has a cost, hence I will favour a "cheap and dirty" solution
to get the ball rolling and try something fancy later.
Though I was intrigued by your tip about LXC, I have used LXC locally on
my workstation for the longest time, but I have not considered running it
on a Beowulf cluster context, that would be a neat thing to investigate.
anyway thanks for the tips
Cheers
On Tue, 28 Jun 2022 at 16:01, Robert G. Brown <rgb at phy.duke.edu> wrote:
> On Tue, 28 Jun 2022, leo camilo wrote:
>
> > I see, so if I understand it correctly I have to make sure that there is
> a
> > copy of the library, environments and modules on every computational
> node?
> >
> > I am wondering if I can get around it by using nfs.
>
> The answer is yes, although it is a bit of a pain.
>
> Two ways to proceed:
>
> Export the library directory(s) from your head node -- at least /usr/lib
> (this assumes, BTW, that the head node and worker nodes are running
> exactly the same version of linux updated to exactly the same level --
> especially the kernel). Mount it on an alternative path e.g.
> /usr/local/lib or /usr/work/lib e.g. during/after boot. Learn how to
> use ldconfig and run it to teach the kernel how to find the libraries
> there. This approach is simple in that you don't need to worry about
> whether or not any particular library is there or isn't there -- you are
> provisioning "everything" present on your head node, so if it works one
> place it works everywhere else.
>
> The second way may be easier if you are already exporting e.g. a home
> directory or work directory, and only need to provision a few
> applications. Use Unix tools (specifically ldd) to figure out what
> libraries are needed for your application. Put copies of those
> libraries in a "personal" link library directory -- e.g.
> /home/joeuser/lib -- and again, use ldconfig as part of your startup/login
> script(s) to teach the kernel where to find them when you run your
> application.
>
> A third way is to look into containers -- https://linuxcontainers.org/
> -- which allow you to build "containerized" binaries that contain all of
> their dependencies and in principle run across DIFFERENT linuces,
> kernels, update levels, etc. The idea there is a containerized app
> doesn't depend directly on the parent operating system "at all" beyond
> running on the right CPU. An immediate advantage is that if somebody
> decides to change or drop some key library in the future, you don't
> care. It's containerized. I have only played with them a bit, mind
> you, but they are supposedly pretty robust and suitable for commercial
> cloud apps etc so they should be fine for you too.
>
> A fourth way -- and this would be my own preference -- would be to just
> install the requisite libraries on the worker nodes (all of which should
> be automagically updated from the primary repo streams anyway to remain
> consistent and up to date). Hard storage is sooooo cheap. You could
> put the entire functional part of the OS including all libraries on
> every system for $10 to $20 via a USB thumb drive, assuming that the
> worker nodes don't ALREADY have enormous amounts of unused space. Speed
> is not likely to be a major issue here as the OS will cache the
> libraries after the initial load assuming that your nodes are
> well-provisioned with RAM, and it has to load the application itself
> anyway. I can't think of a good reason any more -- with TB hard drives
> very nearly the SMALLEST ones readily available to limit what you put on
> a worker node unless you are trying to run it entirely diskless (and for
> the cost, why would you want to do that?).
>
> Remember, YOUR TIME has a cost. You have 7 worker nodes. Putting a 128
> GB hard drive on the USB port of each will cost you (say) $15 each, for
> a total of $105 -- assuming that somehow the nodes currently have only 8
> GB and can't easily hold the missing libraries "permanently". I did
> beowulfery back in the day when storage WAS expensive, and run entirely
> diskless nodes in many cases that booted from the network, and I assure
> you, it is a pain in the ass and pointless when storage is less than
> $0.10/GB. There is simply no point in installing "limited" worker
> nodes, picking and choosing what libraries to include or trying to
> assemble and OS image that lacks e.g. GUI support just because you won't
> be putting a monitor and keyboard on them. Just come up with a standard
> post-install script to run after you do the primary OS install to e.g.
> "dnf -y install gsl" to add in the Gnu scientific library or whatever
> and ensure that the nodes are all updated at the same time for
> consistency, then forget it.
>
> rgb
>
> >
> > On Tue, 28 Jun 2022 at 11:42, Richard <ejb at trick-1.net> wrote:
> > For what it?s worth I use an easy8 licensed bright cluster (now
> > part of NVidia) and I continually find I need to make sure the
> > module packages, environment variables etc are installed/set in
> > the images that are deployed to the nodes
> >
> > Bright supports slurm, k8, jupyter and a lot more
> >
> > Richard
> >
> > Sent from my iPhone
> >
> > > On 28 Jun 2022, at 19:32, leo camilo <lhcamilo at gmail.com>
> > wrote:
> > >
> > >
> > > # Background
> > >
> > > So, I am building this small beowulf cluster for my
> > department. I have it running on ubuntu servers, a front node
> > and at the moment 7 x 16 core nodes. I have installed SLURM as
> > the scheduler and I have been procrastinating to setup
> > environment modules.
> > >
> > > In any case, I ran in this particular scenario where I was
> > trying to schedule a few jobs in slurm, but for some reason
> > slurm would not find this library (libgsl). But it was in fact
> > installed in the frontnode, I checked the path with ldd and I
> > even exported the LD_LIBRARY_PATH .
> > >
> > > Oddly, if I ran the application directly in the frontnode, it
> > would work fine.,
> > >
> > > Though it occured to me that the computational nodes might not
> > have this library and surely once I installed this library in
> > the nodes the problem went away.
> > >
> > > # Question:
> > >
> > > So here is the question, is there a way to cache the
> > frontnode's libraries and environment onto the computational
> > nodes when a slurm job is created?
> > >
> > > Will environment modules do that? If so, how?
> > >
> > > Thanks in advance,
> > >
> > > Cheers
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> > Computing
> > > To change your subscription (digest mode or unsubscribe) visit
> > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> >
> >
> >
>
> Robert G. Brown http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20220628/79b4e3f9/attachment-0001.htm>
More information about the Beowulf
mailing list