[Beowulf] [External] SLURM - Where is this exit status coming from?
Prentice Bisbal
pbisbal at pppl.gov
Thu Aug 13 14:20:57 PDT 2020
I think you dialed the wrong number. We're the Beowulf people! Although,
I'm sure we can still help you. ;)
--
Prentice
On 8/13/20 4:14 PM, Altemara, Anthony wrote:
>
> Cheers SLURM people,
>
> We’re seeing some intermittent job failures in our SLURM cluster, all
> with the same 137 exit code. I’m having difficulty in determining
> whether this error code is coming from SLURM (timeout?) or the Linux
> OS (process killed, maybe memory).
>
> In this example, there’s the WEXITSTATUS in the slurmctld.log, error:0
> status 35072 in the slurd.log, and ExitCode 9:0 in the accounting log….???
>
> Does anyone have insight into how all these correlate? I’ve spent a
> significant amount of time digging through the documentation, and I
> don’t see a clear way on how to interpret all these…
>
> Example: Job: 62791
>
> [root at XXXXXXXXXXXXX] /var/log/slurm# grep -ai jobid=62791 slurmctld.log
>
> [2020-08-13T10:58:28.599] _slurm_rpc_submit_batch_job: JobId=62791
> InitPrio=4294845347 usec=679
>
> [2020-08-13T10:58:29.080] sched: Allocate JobId=62791 NodeList=
> XXXXXXXXXXXXX #CPUs=1 Partition=normal
>
> [2020-08-13T11:17:45.275] _job_complete: JobId=62791 WEXITSTATUS 137
>
> [2020-08-13T11:17:45.294] _job_complete: JobId=62791 done
>
> [root@ XXXXXXXXXXXXX] /var/log/slurm# grep 62791 slurmd.log
>
> [2020-08-13T10:58:29.090] _run_prolog: prolog with lock for job 62791
> ran for 0 seconds
>
> [2020-08-13T10:58:29.090] Launching batch job 62791 for UID 847694
>
> [2020-08-13T11:17:45.280] [62791.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 35072
>
> [2020-08-13T11:17:45.405] [62791.batch] done with job
>
> [root at XXXXXXXXXXXXX] /var/log/slurm# sacct -j 62791
>
> JobID JobName Partition Account AllocCPUS State
> ExitCode
>
> ------------ ---------- ---------- ---------- ---------- ----------
> --------
>
> 62791 nf-normal+ normal (null) 0 FAILED 9:0
>
> [root at XXXXXXXXXXXXX] /var/log/slurm# sacct -lc | tail -n 100 | grep 62791
>
> JobID UID JobName Partition NNodes NodeList State
> Start End Timelimit
>
> 62791 847694 nf-normal+ normal 1 XXXXXXXXXXX.+
> FAILED 2020-08-13T10:58:29 2020-08-13T11:17:45 UNLIMITED
>
> Thank you!
>
> Anthony*__*
>
>
>
> ________________________________________
> *IMPORTANT* - PLEASE READ: This electronic message, including its
> attachments, is CONFIDENTIAL and may contain PROPRIETARY or LEGALLY
> PRIVILEGED or PROTECTED information and is intended for the authorized
> recipient of the sender. If you are not the intended recipient, you
> are hereby notified that any use, disclosure, copying, or distribution
> of this message or any of the information included in it is
> unauthorized and strictly prohibited. If you have received this
> message in error, please immediately notify the sender by reply e-mail
> and permanently delete this message and its attachments, along with
> any copies thereof, from all locations received (e.g., computer,
> mobile device, etc.). To the extent permitted by law, we may monitor
> electronic communications for the purposes of ensuring compliance with
> our legal and regulatory obligations and internal policies. We may
> also collect email traffic headers for analyzing patterns of network
> traffic and managing client relationships. For further information
> see: https://www.iqvia.com/about-us/privacy/privacy-policy. Thank you.
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20200813/25580344/attachment-0001.html>
More information about the Beowulf
mailing list