[Beowulf] HPC meets agentic AI tools, any advice / thoughts?
Chris Dag
dag at sonsorol.org
Tue May 19 13:24:03 UTC 2026
I'm finding the agentic stuff (Claude Code within VSCode IDE) surprisingly
useful for HPC operations, support and troubleshooting- especially when
paired with a RAG that I stuffed full of HPC specific documentation on
slurm, relion, cryosparc, schrodinger, posit workbench and AWS parallelcluster
details -- basically the RAG is full of the stuff I need to do every day.
My guardrails are usually:
- The https://hpc-mcp.apps.bioteam.cloud/ custom RAG; I force my agent to
challenge itself against the RAG for every determination, config suggestion
or parameter setting
- A read-only mode for running slurm commands or otherwise poking at the
filesystem and logs; sudo acts require manual human review before proceeding
- A limited write-allowed mode so it can submit slurm jobs as me, run a
schrodinger job with a license request or pull a schrodinger job failure
postmortem file out of the cluster
- When I want the agent to actually "do stuff" on an HPC system I'll have
it write out a local ansible playbook or bash script or .md file containing
instructions. I'll run local linters and static security scanners against
bash, terraform and ansible files and then have the agent stage them to git
or the HPC filesystem where I can manually review and run them. For more
complex deployments involving lots of playbooks or terraform I'll invoke a
5-agent "review committee" to audit the files before they go anywhere.
- In all cases I manually run the terraform, bash script or ansible
playbook
- In all cases, my instructions force the LLM to query my custom hpc-docs
MCP to challenge and verify all commands. 90% of the time, it finds a
mistake or hallucination involving parallelcluster, slurm, or schrodinger
specific settings or commands that the RAG will catch and force the agent
to fix.
- I use Claude Teams so that Anthropic does not train on our data. I've got
a M4 Pro mac mini connected to tailscale running a local LLM that I'll
sometimes offload jobs to but it's nowhere near as good as the frontier
models and super slow
My hpc-docs rag is online here with lots of technical documentation on
sources, ingest and architecture -- https://hpc-mcp.apps.bioteam.cloud/
the actual RAG content itself is gated behind our Okta SSO server because
I've stuffed the rag with content that is not actually public (some
consulting notes, some vendor stuff that is not public) so it's not
actually useful by anyone else but I'd love to learn what others are doing
in this space.
I admit I'm kinda terrified what user-space people would do if not paranoid
and careful. I think main risk is data loss or mangling on the local HPC
but also the data leakage risk of end-users talking to remote LLMs and
sending info there they should not.
my $.02 only
On Tue, May 19, 2026 at 8:14 AM Peter Clapham <pc7 at sanger.ac.uk> wrote:
> OK, so sticking my neck out a little here
>
> How are people covering the risks from AI agentic tools across their HPC
> platforms
>
> Ducks and listens…
>
> Pete
>
> Sent from Outlook for Mac
> ------------------------------
> The Wellcome Sanger Institute is operated by Genome Research Limited, a
> charity registered in England with number 1021457 and a company registered
> in England with number 2742969, whose registered office is Wellcome Sanger
> Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20260519/fbc1584d/attachment-0001.htm>
More information about the Beowulf
mailing list