[GE users] Job puts entire cluster into Error state over misplaced pid file? Help!

Bevan C. Bennett bevan at fulcrummicro.com
Fri Sep 7 19:06:11 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Can anyone give me a clue of why this might be happening?

Once or twice a week this occurs (job can't find it's jid file) and wipes out 
the entire cluster.

Bevan C. Bennett wrote:
> We're running GE 6.1u2 and have started seeing a rather horrible situation.
> 
> Regularly now, a job will run with something wrong to it. The job then sets the
> -queue- into the Error state and gets re-run, setting that queue into the Error
> state, etc. The result is that the entire cluster gets suddenly brought to a
> standstill by one job's problem.
> 
> Is there some easy way to have these errors correctly mark the 'job' as the
> error rather than the queue or, better yet, stop these from occurring
> altogether? The directories referenced (/scratch/JID.1.QUEUE/) do not appear to
> exist for any job on any compute server, so I'm at a loss as to why these jobs
> want to look there. In general, the files appear in our spool directories, but
> without the queue name appended:
> [bevan at alexander grid]$ find . -name pid
> ./default/spool/gallium/active_jobs/487277.1/pid
> ./default/spool/caesium/active_jobs/832651.1/pid
> ./default/spool/iodine/active_jobs/838053.1/pid
> ./default/spool/tin/active_jobs/832986.1/pid
> ./default/spool/aluminium/active_jobs/838885.1/pid
> ./default/spool/ruthenium/active_jobs/832081.1/pid
> ./default/spool/ruthenium/active_jobs/834522.1/pid
> ./default/spool/arsenic/active_jobs/828264.1/pid
> ./default/spool/arsenic/active_jobs/838866.1/pid
> ./default/spool/promethium/active_jobs/833225.1/pid
> ./default/spool/ytterbium/active_jobs/827839.1/pid
> ./default/spool/osmium/active_jobs/827987.1/pid
> [bevan at alexander grid]$ find . -name job_pid
> ./default/spool/gallium/active_jobs/487277.1/job_pid
> ./default/spool/caesium/active_jobs/832651.1/job_pid
> ./default/spool/iodine/active_jobs/838053.1/job_pid
> ./default/spool/tin/active_jobs/832986.1/job_pid
> ./default/spool/aluminium/active_jobs/838885.1/job_pid
> ./default/spool/ruthenium/active_jobs/832081.1/job_pid
> ./default/spool/ruthenium/active_jobs/834522.1/job_pid
> ./default/spool/arsenic/active_jobs/828264.1/job_pid
> ./default/spool/arsenic/active_jobs/838866.1/job_pid
> ./default/spool/promethium/active_jobs/833225.1/job_pid
> ./default/spool/ytterbium/active_jobs/827839.1/job_pid
> ./default/spool/osmium/active_jobs/827987.1/job_pid
> 
> /scratch is set as our "tmp directory", if that helps...
> 
> 
> 
> Two examples:
> 
> Job 835775 caused action: Queue "all.q at palladium.internal.avlsi.com" set to ERROR
> ...
> failed before job:08/27/2007 16:22:18 [0:25331]: can't open file
> /scratch/835775.1.all.q/pid: No such file or director
> Shepherd trace:
> 08/27/2007 16:22:18 [5143:25331]: shepherd called with uid = 0, euid = 5143
> 08/27/2007 16:22:18 [5143:25331]: starting up 6.1u2
> 08/27/2007 16:22:18 [5143:25331]: setpgid(25331, 25331) returned 0
> 08/27/2007 16:22:18 [5143:25331]: no prolog script to start
> 08/27/2007 16:22:18 [5143:25332]: processing qlogin job
> 08/27/2007 16:22:18 [5143:25332]: pid=25332 pgrp=25332 sid=25332 old pgrp=25331
> getlogin()=<no login set>
> 08/27/2007 16:22:18 [5143:25332]: reading passwd information for user 'root'
> 08/27/2007 16:22:18 [5143:25331]: forked "job" with pid 25332
> 08/27/2007 16:22:18 [5143:25332]: setosjobid: uid = 0, euid = 5143
> 08/27/2007 16:22:18 [5143:25331]: child: job - pid: 25332
> 08/27/2007 16:22:18 [5143:25332]: setting limits
> 08/27/2007 16:22:18 [5143:25332]: RLIMIT_CPU setting: (soft 18446744073709551615
> hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
> 18446744073709551615)
> 08/27/2007 16:22:18 [5143:25332]: RLIMIT_FSIZE setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 08/27/2007 16:22:18 [5143:25332]: RLIMIT_DATA setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 08/27/2007 16:22:18 [5143:25332]: RLIMIT_STACK setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 08/27/2007 16:22:18 [5143:25332]: RLIMIT_CORE setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 08/27/2007 16:22:18 [5143:25332]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 08/27/2007 16:22:18 [5143:25332]: RLIMIT_RSS setting: (soft 18446744073709551615
> hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
> 18446744073709551615)
> 08/27/2007 16:22:18 [5143:25332]: setting environment
> 08/27/2007 16:22:18 [5143:25332]: Initializing error file
> 08/27/2007 16:22:18 [5143:25332]: switching to intermediate/target user
> 08/27/2007 16:22:18 [9114:25332]: closing all filedescriptors
> 08/27/2007 16:22:18 [9114:25332]: further messages are in "error" and "trace"
> 08/27/2007 16:22:18 [0:25332]: now running with uid=0, euid=0
> 08/27/2007 16:22:18 [0:25332]: start qlogin
> 08/27/2007 16:22:18 [0:25332]: calling
> qlogin_starter(/mnt/fulcrum/local/common/grid-6.0/default/spool/palladium/active_jobs/835775.1,
> /usr/sbin/sshd-grid -i);
> 08/27/2007 16:22:18 [0:25332]: uid = 0, euid = 0, gid = 0, egid = 0
> 08/27/2007 16:22:18 [0:25332]: using sfd 1
> 08/27/2007 16:22:18 [0:25332]: bound to port 34327
> 08/27/2007 16:22:18 [0:25332]: write_to_qrsh - data =
> 0:34327:/usr/local/grid-6.0/utilbin/lx24-amd64:/mnt/fulcrum/local/common/grid-6.0/default/spool/palladium/active_jobs/835775.1:palladium.internal.avlsi.com
> 08/27/2007 16:22:18 [0:25332]: write_to_qrsh - address = napoleon:52957
> 08/27/2007 16:22:18 [0:25332]: write_to_qrsh - host = napoleon, port = 52957
> 08/27/2007 16:22:18 [0:25332]: waiting for connection.
> 08/27/2007 16:22:18 [0:25332]: accepted connection on fd 2
> 08/27/2007 16:22:18 [0:25332]: daemon to start: |/usr/sbin/sshd-grid -i|
> 08/27/2007 16:22:18 [5143:25331]: wait3 returned 25332 (status: 0; WIFSIGNALED:
> 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> 08/27/2007 16:22:18 [5143:25331]: job exited with exit status 0
> 08/27/2007 16:22:18 [5143:25331]: reaped "job" with pid 25332
> 08/27/2007 16:22:18 [5143:25331]: job exited not due to signal
> 08/27/2007 16:22:18 [5143:25331]: job exited with status 0
> 08/27/2007 16:22:18 [0:25331]: can't open file /scratch/835775.1.all.q/pid: No
> such file or directory
> 08/27/2007 16:22:18 [0:25331]: write_to_qrsh - data = 1:can't open file
> /scratch/835775.1.all.q/pid: No such file or directory
> 08/27/2007 16:22:18 [0:25331]: write_to_qrsh - address = napoleon:52957
> 08/27/2007 16:22:18 [0:25331]: write_to_qrsh - host = napoleon, port = 52957
> 
> Shepherd error:
> 08/27/2007 16:22:18 [0:25331]: can't open file /scratch/835775.1.all.q/pid: No
> such file or directory
> 
> ------------------------------------------------------------------------------
> Job 838860 caused action: Queue "all.q at indium.internal.avlsi.com" set to ERROR
> ...
> failed before job:08/28/2007 15:59:05 [0:4829]: can't open file job_pid:
> Permission denied
> Shepherd trace:
> 08/28/2007 15:59:05 [5143:4828]: shepherd called with uid = 0, euid = 5143
> 08/28/2007 15:59:05 [5143:4828]: starting up 6.1u2
> 08/28/2007 15:59:05 [5143:4828]: setpgid(4828, 4828) returned 0
> 08/28/2007 15:59:05 [5143:4828]: no prolog script to start
> 08/28/2007 15:59:05 [5143:4829]: processing qlogin job
> 08/28/2007 15:59:05 [5143:4829]: pid=4829 pgrp=4829 sid=4829 old pgrp=4828
> getlogin()=<no login set>
> 08/28/2007 15:59:05 [5143:4829]: reading passwd information for user 'root'
> 08/28/2007 15:59:05 [5143:4829]: setosjobid: uid = 0, euid = 5143
> 08/28/2007 15:59:05 [5143:4828]: forked "job" with pid 4829
> 08/28/2007 15:59:05 [5143:4829]: setting limits
> 08/28/2007 15:59:05 [5143:4829]: RLIMIT_CPU setting: (soft 18446744073709551615
> hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
> 18446744073709551615)
> 08/28/2007 15:59:05 [5143:4829]: RLIMIT_FSIZE setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 08/28/2007 15:59:05 [5143:4829]: RLIMIT_DATA setting: (soft 18446744073709551615
> hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
> 18446744073709551615)
> 08/28/2007 15:59:05 [5143:4829]: RLIMIT_STACK setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 08/28/2007 15:59:05 [5143:4829]: RLIMIT_CORE setting: (soft 18446744073709551615
> hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
> 18446744073709551615)
> 08/28/2007 15:59:05 [5143:4829]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 08/28/2007 15:59:05 [5143:4829]: RLIMIT_RSS setting: (soft 18446744073709551615
> hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
> 18446744073709551615)
> 08/28/2007 15:59:05 [5143:4829]: setting environment
> 08/28/2007 15:59:05 [5143:4828]: child: job - pid: 4829
> 08/28/2007 15:59:05 [5143:4829]: Initializing error file
> 08/28/2007 15:59:05 [5143:4829]: switching to intermediate/target user
> 08/28/2007 15:59:05 [517:4829]: closing all filedescriptors
> 08/28/2007 15:59:05 [517:4829]: further messages are in "error" and "trace"
> 08/28/2007 15:59:05 [0:4829]: now running with uid=0, euid=0
> 08/28/2007 15:59:05 [0:4829]: start qlogin
> 08/28/2007 15:59:05 [0:4829]: calling
> qlogin_starter(/mnt/fulcrum/local/common/grid-6.0/default/spool/indium/active_jobs/838860.1,
> /usr/sbin/sshd-grid -i);
> 08/28/2007 15:59:05 [0:4829]: uid = 0, euid = 0, gid = 0, egid = 0
> 08/28/2007 15:59:05 [0:4829]: using sfd 1
> 08/28/2007 15:59:05 [0:4829]: bound to port 56480
> 08/28/2007 15:59:05 [0:4829]: write_to_qrsh - data =
> 0:56480:/usr/local/grid-6.0/utilbin/lx24-amd64:/mnt/fulcrum/local/common/grid-6.0/default/spool/indium/active_jobs/838860.1:indium.internal.avlsi.com
> 08/28/2007 15:59:05 [0:4829]: write_to_qrsh - address = mithridates:58471
> 08/28/2007 15:59:05 [0:4829]: write_to_qrsh - host = mithridates, port = 58471
> 08/28/2007 15:59:05 [0:4829]: error connecting stream socket: Connection refused
> 08/28/2007 15:59:05 [0:4829]: communication with qrsh failed
> 08/28/2007 15:59:05 [0:4829]: forked "job" with pid 0
> 08/28/2007 15:59:05 [0:4829]: can't open file job_pid: Permission denied
> 08/28/2007 15:59:05 [0:4829]: write_to_qrsh - data = 1:can't open file job_pid:
> Permission denied
> 08/28/2007 15:59:05 [0:4829]: write_to_qrsh - address = mithridates
> 08/28/2007 15:59:05 [0:4829]: illegal value for qrsh_control_port:
> "mithridates". Should be host:port
> 08/28/2007 15:59:05 [5143:4828]: wait3 returned 4829 (status: 2816; WIFSIGNALED:
> 0,  WIFEXITED: 1, WEXITSTATUS: 11)
> 08/28/2007 15:59:05 [5143:4828]: job exited with exit status 11
> 08/28/2007 15:59:05 [5143:4828]: reaped "job" with pid 4829
> 08/28/2007 15:59:05 [5143:4828]: job exited not due to signal
> 08/28/2007 15:59:05 [5143:4828]: job exited with status 11
> 08/28/2007 15:59:05 [0:4828]: can't open file /scratch/838860.1.all.q/pid: No
> such file or directory
> 08/28/2007 15:59:05 [0:4828]: write_to_qrsh - data = 1:can't open file
> /scratch/838860.1.all.q/pid: No such file or directory
> 08/28/2007 15:59:05 [0:4828]: write_to_qrsh - address = mithridates:58471
> 08/28/2007 15:59:05 [0:4828]: write_to_qrsh - host = mithridates, port = 58471
> 08/28/2007 15:59:05 [0:4828]: error connecting stream socket: Connection refused
> 
> Shepherd error:
> 08/28/2007 15:59:05 [0:4829]: can't open file job_pid: Permission denied
> 08/28/2007 15:59:05 [0:4828]: can't open file /scratch/838860.1.all.q/pid: No
> such file or directory
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list