[GE users] Job puts entire cluster into Error state over misplaced pid file? Help!

Beadles, Jeff jeff_beadles at mentor.com
Fri Sep 7 21:23:29 BST 2007


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Have you tried looking in the messages file in spool directory for the execution host? It should have the reason for why the system was put into an error state.
 
Regards,  -Jeff

________________________________

From: Bevan C. Bennett [mailto:bevan at fulcrummicro.com]
Sent: Fri 9/7/2007 11:06 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Job puts entire cluster into Error state over misplaced pid file? Help!



Can anyone give me a clue of why this might be happening?

Once or twice a week this occurs (job can't find it's jid file) and wipes out
the entire cluster.

Bevan C. Bennett wrote:
> We're running GE 6.1u2 and have started seeing a rather horrible situation.
>
> Regularly now, a job will run with something wrong to it. The job then sets the
> -queue- into the Error state and gets re-run, setting that queue into the Error
> state, etc. The result is that the entire cluster gets suddenly brought to a
> standstill by one job's problem.
>
> Is there some easy way to have these errors correctly mark the 'job' as the
> error rather than the queue or, better yet, stop these from occurring
> altogether? The directories referenced (/scratch/JID.1.QUEUE/) do not appear to
> exist for any job on any compute server, so I'm at a loss as to why these jobs
> want to look there. In general, the files appear in our spool directories, but
> without the queue name appended:
> [bevan at alexander grid]$ find . -name pid
> ./default/spool/gallium/active_jobs/487277.1/pid
> ./default/spool/caesium/active_jobs/832651.1/pid
> ./default/spool/iodine/active_jobs/838053.1/pid
> ./default/spool/tin/active_jobs/832986.1/pid
> ./default/spool/aluminium/active_jobs/838885.1/pid
> ./default/spool/ruthenium/active_jobs/832081.1/pid
> ./default/spool/ruthenium/active_jobs/834522.1/pid
> ./default/spool/arsenic/active_jobs/828264.1/pid
> ./default/spool/arsenic/active_jobs/838866.1/pid
> ./default/spool/promethium/active_jobs/833225.1/pid
> ./default/spool/ytterbium/active_jobs/827839.1/pid
> ./default/spool/osmium/active_jobs/827987.1/pid
> [bevan at alexander grid]$ find . -name job_pid
> ./default/spool/gallium/active_jobs/487277.1/job_pid
> ./default/spool/caesium/active_jobs/832651.1/job_pid
> ./default/spool/iodine/active_jobs/838053.1/job_pid
> ./default/spool/tin/active_jobs/832986.1/job_pid
> ./default/spool/aluminium/active_jobs/838885.1/job_pid
> ./default/spool/ruthenium/active_jobs/832081.1/job_pid
> ./default/spool/ruthenium/active_jobs/834522.1/job_pid
> ./default/spool/arsenic/active_jobs/828264.1/job_pid
> ./default/spool/arsenic/active_jobs/838866.1/job_pid
> ./default/spool/promethium/active_jobs/833225.1/job_pid
> ./default/spool/ytterbium/active_jobs/827839.1/job_pid
> ./default/spool/osmium/active_jobs/827987.1/job_pid
>
> /scratch is set as our "tmp directory", if that helps...
>
>
>
> Two examples:
>
> Job 835775 caused action: Queue "all.q at palladium.internal.avlsi.com" set to ERROR
> ...
> failed before job:08/27/2007 16:22:18 [0:25331]: can't open file
> /scratch/835775.1.all.q/pid: No such file or director
> Shepherd trace:
> 08/27/2007 16:22:18 [5143:25331]: shepherd called with uid = 0, euid = 5143
> 08/27/2007 16:22:18 [5143:25331]: starting up 6.1u2
> 08/27/2007 16:22:18 [5143:25331]: setpgid(25331, 25331) returned 0
> 08/27/2007 16:22:18 [5143:25331]: no prolog script to start
> 08/27/2007 16:22:18 [5143:25332]: processing qlogin job
> 08/27/2007 16:22:18 [5143:25332]: pid=25332 pgrp=25332 sid=25332 old pgrp=25331
> getlogin()=<no login set>
> 08/27/2007 16:22:18 [5143:25332]: reading passwd information for user 'root'
> 08/27/2007 16:22:18 [5143:25331]: forked "job" with pid 25332
> 08/27/2007 16:22:18 [5143:25332]: setosjobid: uid = 0, euid = 5143
> 08/27/2007 16:22:18 [5143:25331]: child: job - pid: 25332
> 08/27/2007 16:22:18 [5143:25332]: setting limits
> 08/27/2007 16:22:18 [5143:25332]: RLIMIT_CPU setting: (soft 18446744073709551615
> hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
> 18446744073709551615)
> 08/27/2007 16:22:18 [5143:25332]: RLIMIT_FSIZE setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 08/27/2007 16:22:18 [5143:25332]: RLIMIT_DATA setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 08/27/2007 16:22:18 [5143:25332]: RLIMIT_STACK setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 08/27/2007 16:22:18 [5143:25332]: RLIMIT_CORE setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 08/27/2007 16:22:18 [5143:25332]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 08/27/2007 16:22:18 [5143:25332]: RLIMIT_RSS setting: (soft 18446744073709551615
> hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
> 18446744073709551615)
> 08/27/2007 16:22:18 [5143:25332]: setting environment
> 08/27/2007 16:22:18 [5143:25332]: Initializing error file
> 08/27/2007 16:22:18 [5143:25332]: switching to intermediate/target user
> 08/27/2007 16:22:18 [9114:25332]: closing all filedescriptors
> 08/27/2007 16:22:18 [9114:25332]: further messages are in "error" and "trace"
> 08/27/2007 16:22:18 [0:25332]: now running with uid=0, euid=0
> 08/27/2007 16:22:18 [0:25332]: start qlogin
> 08/27/2007 16:22:18 [0:25332]: calling
> qlogin_starter(/mnt/fulcrum/local/common/grid-6.0/default/spool/palladium/active_jobs/835775.1,
> /usr/sbin/sshd-grid -i);
> 08/27/2007 16:22:18 [0:25332]: uid = 0, euid = 0, gid = 0, egid = 0
> 08/27/2007 16:22:18 [0:25332]: using sfd 1
> 08/27/2007 16:22:18 [0:25332]: bound to port 34327
> 08/27/2007 16:22:18 [0:25332]: write_to_qrsh - data =
> 0:34327:/usr/local/grid-6.0/utilbin/lx24-amd64:/mnt/fulcrum/local/common/grid-6.0/default/spool/palladium/active_jobs/835775.1:palladium.internal.avlsi.com
> 08/27/2007 16:22:18 [0:25332]: write_to_qrsh - address = napoleon:52957
> 08/27/2007 16:22:18 [0:25332]: write_to_qrsh - host = napoleon, port = 52957
> 08/27/2007 16:22:18 [0:25332]: waiting for connection.
> 08/27/2007 16:22:18 [0:25332]: accepted connection on fd 2
> 08/27/2007 16:22:18 [0:25332]: daemon to start: |/usr/sbin/sshd-grid -i|
> 08/27/2007 16:22:18 [5143:25331]: wait3 returned 25332 (status: 0; WIFSIGNALED:
> 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> 08/27/2007 16:22:18 [5143:25331]: job exited with exit status 0
> 08/27/2007 16:22:18 [5143:25331]: reaped "job" with pid 25332
> 08/27/2007 16:22:18 [5143:25331]: job exited not due to signal
> 08/27/2007 16:22:18 [5143:25331]: job exited with status 0
> 08/27/2007 16:22:18 [0:25331]: can't open file /scratch/835775.1.all.q/pid: No
> such file or directory
> 08/27/2007 16:22:18 [0:25331]: write_to_qrsh - data = 1:can't open file
> /scratch/835775.1.all.q/pid: No such file or directory
> 08/27/2007 16:22:18 [0:25331]: write_to_qrsh - address = napoleon:52957
> 08/27/2007 16:22:18 [0:25331]: write_to_qrsh - host = napoleon, port = 52957
>
> Shepherd error:
> 08/27/2007 16:22:18 [0:25331]: can't open file /scratch/835775.1.all.q/pid: No
> such file or directory
>
> ------------------------------------------------------------------------------
> Job 838860 caused action: Queue "all.q at indium.internal.avlsi.com" set to ERROR
> ...
> failed before job:08/28/2007 15:59:05 [0:4829]: can't open file job_pid:
> Permission denied
> Shepherd trace:
> 08/28/2007 15:59:05 [5143:4828]: shepherd called with uid = 0, euid = 5143
> 08/28/2007 15:59:05 [5143:4828]: starting up 6.1u2
> 08/28/2007 15:59:05 [5143:4828]: setpgid(4828, 4828) returned 0
> 08/28/2007 15:59:05 [5143:4828]: no prolog script to start
> 08/28/2007 15:59:05 [5143:4829]: processing qlogin job
> 08/28/2007 15:59:05 [5143:4829]: pid=4829 pgrp=4829 sid=4829 old pgrp=4828
> getlogin()=<no login set>
> 08/28/2007 15:59:05 [5143:4829]: reading passwd information for user 'root'
> 08/28/2007 15:59:05 [5143:4829]: setosjobid: uid = 0, euid = 5143
> 08/28/2007 15:59:05 [5143:4828]: forked "job" with pid 4829
> 08/28/2007 15:59:05 [5143:4829]: setting limits
> 08/28/2007 15:59:05 [5143:4829]: RLIMIT_CPU setting: (soft 18446744073709551615
> hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
> 18446744073709551615)
> 08/28/2007 15:59:05 [5143:4829]: RLIMIT_FSIZE setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 08/28/2007 15:59:05 [5143:4829]: RLIMIT_DATA setting: (soft 18446744073709551615
> hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
> 18446744073709551615)
> 08/28/2007 15:59:05 [5143:4829]: RLIMIT_STACK setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 08/28/2007 15:59:05 [5143:4829]: RLIMIT_CORE setting: (soft 18446744073709551615
> hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
> 18446744073709551615)
> 08/28/2007 15:59:05 [5143:4829]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 08/28/2007 15:59:05 [5143:4829]: RLIMIT_RSS setting: (soft 18446744073709551615
> hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
> 18446744073709551615)
> 08/28/2007 15:59:05 [5143:4829]: setting environment
> 08/28/2007 15:59:05 [5143:4828]: child: job - pid: 4829
> 08/28/2007 15:59:05 [5143:4829]: Initializing error file
> 08/28/2007 15:59:05 [5143:4829]: switching to intermediate/target user
> 08/28/2007 15:59:05 [517:4829]: closing all filedescriptors
> 08/28/2007 15:59:05 [517:4829]: further messages are in "error" and "trace"
> 08/28/2007 15:59:05 [0:4829]: now running with uid=0, euid=0
> 08/28/2007 15:59:05 [0:4829]: start qlogin
> 08/28/2007 15:59:05 [0:4829]: calling
> qlogin_starter(/mnt/fulcrum/local/common/grid-6.0/default/spool/indium/active_jobs/838860.1,
> /usr/sbin/sshd-grid -i);
> 08/28/2007 15:59:05 [0:4829]: uid = 0, euid = 0, gid = 0, egid = 0
> 08/28/2007 15:59:05 [0:4829]: using sfd 1
> 08/28/2007 15:59:05 [0:4829]: bound to port 56480
> 08/28/2007 15:59:05 [0:4829]: write_to_qrsh - data =
> 0:56480:/usr/local/grid-6.0/utilbin/lx24-amd64:/mnt/fulcrum/local/common/grid-6.0/default/spool/indium/active_jobs/838860.1:indium.internal.avlsi.com
> 08/28/2007 15:59:05 [0:4829]: write_to_qrsh - address = mithridates:58471
> 08/28/2007 15:59:05 [0:4829]: write_to_qrsh - host = mithridates, port = 58471
> 08/28/2007 15:59:05 [0:4829]: error connecting stream socket: Connection refused
> 08/28/2007 15:59:05 [0:4829]: communication with qrsh failed
> 08/28/2007 15:59:05 [0:4829]: forked "job" with pid 0
> 08/28/2007 15:59:05 [0:4829]: can't open file job_pid: Permission denied
> 08/28/2007 15:59:05 [0:4829]: write_to_qrsh - data = 1:can't open file job_pid:
> Permission denied
> 08/28/2007 15:59:05 [0:4829]: write_to_qrsh - address = mithridates
> 08/28/2007 15:59:05 [0:4829]: illegal value for qrsh_control_port:
> "mithridates". Should be host:port
> 08/28/2007 15:59:05 [5143:4828]: wait3 returned 4829 (status: 2816; WIFSIGNALED:
> 0,  WIFEXITED: 1, WEXITSTATUS: 11)
> 08/28/2007 15:59:05 [5143:4828]: job exited with exit status 11
> 08/28/2007 15:59:05 [5143:4828]: reaped "job" with pid 4829
> 08/28/2007 15:59:05 [5143:4828]: job exited not due to signal
> 08/28/2007 15:59:05 [5143:4828]: job exited with status 11
> 08/28/2007 15:59:05 [0:4828]: can't open file /scratch/838860.1.all.q/pid: No
> such file or directory
> 08/28/2007 15:59:05 [0:4828]: write_to_qrsh - data = 1:can't open file
> /scratch/838860.1.all.q/pid: No such file or directory
> 08/28/2007 15:59:05 [0:4828]: write_to_qrsh - address = mithridates:58471
> 08/28/2007 15:59:05 [0:4828]: write_to_qrsh - host = mithridates, port = 58471
> 08/28/2007 15:59:05 [0:4828]: error connecting stream socket: Connection refused
>
> Shepherd error:
> 08/28/2007 15:59:05 [0:4829]: can't open file job_pid: Permission denied
> 08/28/2007 15:59:05 [0:4828]: can't open file /scratch/838860.1.all.q/pid: No
> such file or directory
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net






More information about the gridengine-users mailing list