[Beowulf] lustre / pytorch
plegresl at gmail.com
plegresl at gmail.com
Fri Jul 12 17:37:41 UTC 2024
I’ve never seen any difficulties with PyTorch saving checkpoint files to Lustre. Is it a special file format or just torch.save()? When the processes hang, have you tried using something like py-spy and/or gdb to get a stack trace of where in the software stack it’s hung?
> Date: Thu, 11 Jul 2024 12:25:05 -0400
> From: Michael DiDomenico <mdidomenico4 at gmail.com>
> To: Beowulf Mailing List <Beowulf at beowulf.org>
> Subject: [Beowulf] lustre / pytorch
> Message-ID:
> <CABOsP2P7L4J8kJQRqxC9U_yJ3MLjhj68Z6fy17O5+E0WeEyUww at mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> i have a strange problem, but honestly i'm not sure where the issue
> is. we have users running LLM models through pytorch. part of the
> process saves off checkpoints at periodic intervals. when the
> checkpoint files are being written we can see in the logs the pytorch
> writing out the save files from each of the processes to lustre.
>
> it chugs along for a little bit, but then comes to a grinding halt.
> no error from pytorch is logged and no errors can be found on the
> lustre clients or servers. the problem is also not transient, it
> happens every time the process runs
>
> the weird part is, if we switch the output directory from lustre to
> nfs (netapp backed), the pytorch run works perfectly fine
>
> has anyone seen anything like this? any suggestions on trouble
> shooting the issue?
>
> given that we have a 10x performance difference between netapp and
> lustre, i'm pretty keen on getting this fixed
More information about the Beowulf
mailing list