[Beowulf] Understanding environments and libraries caching on a beowulf cluster

Tue Jun 28 15:44:03 UTC 2022

Thanks Robert,

You have given me a lot to think about.
Most of our nodes have around 250GB SSDs largely unpopulated so I am
guessing there is no harm in just installing the libraries in every node
with ansible. Also, in our department we have a wealth of old HDDs we could
repurpose
My time indeed has a cost, hence I will favour a "cheap and dirty" solution
to get the ball rolling and try something fancy later.
Though I was intrigued by your tip about LXC, I have  used LXC locally on
my workstation for the longest time, but I have not  considered running it
on a Beowulf cluster context, that would be a neat thing to investigate.

anyway thanks for the tips

Cheers

On Tue, 28 Jun 2022 at 16:01, Robert G. Brown <rgb at phy.duke.edu> wrote:

> On Tue, 28 Jun 2022, leo camilo wrote:
>
> > I see, so if I understand it correctly I have to make sure that there is
> a
> > copy of the library, environments and modules on every computational
> node?
> >
> > I am wondering if I can get around it by using nfs.
>
> The answer is yes, although it is a bit of a pain.
>
> Two ways to proceed:
>
> Export the library directory(s) from your head node -- at least /usr/lib
> (this assumes, BTW, that the head node and worker nodes are running
> exactly the same version of linux updated to exactly the same level --
> especially the kernel).  Mount it on an alternative path e.g.
> /usr/local/lib or /usr/work/lib e.g. during/after boot.  Learn how to
> use ldconfig and run it to teach the kernel how to find the libraries
> there.  This approach is simple in that you don't need to worry about
> whether or not any particular library is there or isn't there -- you are
> provisioning "everything" present on your head node, so if it works one
> place it works everywhere else.
>
> The second way may be easier if you are already exporting e.g. a home
> directory or work directory, and only need to provision a few
> applications.  Use Unix tools (specifically ldd) to figure out what
> libraries are needed for your application.  Put copies of those
> libraries in a "personal" link library directory -- e.g.
> /home/joeuser/lib -- and again, use ldconfig as part of your startup/login
> script(s) to teach the kernel where to find them when you run your
> application.
>
> A third way is to look into containers -- https://linuxcontainers.org/
> -- which allow you to build "containerized" binaries that contain all of
> their dependencies and in principle run across DIFFERENT linuces,
> kernels, update levels, etc.  The idea there is a containerized app
> doesn't depend directly on the parent operating system "at all" beyond
> running on the right CPU.  An immediate advantage is that if somebody
> decides to change or drop some key library in the future, you don't
> care.  It's containerized.  I have only played with them a bit, mind
> you, but they are supposedly pretty robust and suitable for commercial
> cloud apps etc so they should be fine for you too.
>
> A fourth way -- and this would be my own preference -- would be to just
> install the requisite libraries on the worker nodes (all of which should
> be automagically updated from the primary repo streams anyway to remain
> consistent and up to date).  Hard storage is sooooo cheap.  You could
> put the entire functional part of the OS including all libraries on
> every system for $10 to $20 via a USB thumb drive, assuming that the
> worker nodes don't ALREADY have enormous amounts of unused space.  Speed
> is not likely to be a major issue here as the OS will cache the
> libraries after the initial load assuming that your nodes are
> well-provisioned with RAM, and it has to load the application itself
> anyway.  I can't think of a good reason any more -- with TB hard drives
> very nearly the SMALLEST ones readily available to limit what you put on
> a worker node unless you are trying to run it entirely diskless (and for
> the cost, why would you want to do that?).
>
> Remember, YOUR TIME has a cost.  You have 7 worker nodes.  Putting a 128
> GB hard drive on the USB port of each will cost you (say) $15 each, for
> a total of $105 -- assuming that somehow the nodes currently have only 8
> GB and can't easily hold the missing libraries "permanently".  I did
> beowulfery back in the day when storage WAS expensive, and run entirely
> diskless nodes in many cases that booted from the network, and I assure
> you, it is a pain in the ass and pointless when storage is less than
> $0.10/GB.  There is simply no point in installing "limited" worker
> nodes, picking and choosing what libraries to include or trying to
> assemble and OS image that lacks e.g. GUI support just because you won't
> be putting a monitor and keyboard on them.  Just come up with a standard
> post-install script to run after you do the primary OS install to e.g.
> "dnf -y install gsl" to add in the Gnu scientific library or whatever
> and ensure that the nodes are all updated at the same time for
> consistency, then forget it.
>
>     rgb
>
> >
> > On Tue, 28 Jun 2022 at 11:42, Richard <ejb at trick-1.net> wrote:
> >       For what it?s worth I use an easy8 licensed bright cluster (now
> >       part of NVidia) and I continually find I need to make sure the
> >       module packages, environment variables etc are installed/set in
> >       the images that are deployed to the nodes
> >
> >       Bright supports slurm, k8, jupyter and a lot more
> >
> >       Richard
> >
> >       Sent from my iPhone
> >
> >       > On 28 Jun 2022, at 19:32, leo camilo <lhcamilo at gmail.com>
> >       wrote:
> >       >
> >       >
> >       > # Background
> >       >
> >       > So, I am building this small beowulf cluster for my
> >       department. I have it running on ubuntu servers, a front node
> >       and at the moment 7 x 16 core nodes. I have installed SLURM as
> >       the scheduler and I have been procrastinating to setup
> >       environment modules.
> >       >
> >       > In any case, I ran in this particular scenario where I was
> >       trying to schedule a few jobs in slurm, but for some reason
> >       slurm would not find this library (libgsl). But it was in fact
> >       installed in the frontnode, I checked the path with ldd and I
> >       even exported the LD_LIBRARY_PATH .
> >       >
> >       > Oddly, if I ran the application directly in the frontnode, it
> >       would work fine.,
> >       >
> >       > Though it occured to me that the computational nodes might not
> >       have this library and surely once I installed this library in
> >       the nodes the problem went away.
> >       >
> >       > # Question:
> >       >
> >       > So here is the question, is there a way to cache the
> >       frontnode's libraries and environment onto the computational
> >       nodes when a slurm job is created?
> >       >
> >       > Will environment modules do that? If so, how?
> >       >
> >       > Thanks in advance,
> >       >
> >       > Cheers
> >       > _______________________________________________
> >       > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> >       Computing
> >       > To change your subscription (digest mode or unsubscribe) visit
> >       https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> >
> >
> >
>
> Robert G. Brown                        http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20220628/79b4e3f9/attachment-0001.htm>