[GE users] Job puts entire cluster into Error state over misplaced pid file? Help!

Bevan C. Bennett bevan at fulcrummicro.com
Wed Aug 29 00:57:11 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

We're running GE 6.1u2 and have started seeing a rather horrible situation.

Regularly now, a job will run with something wrong to it. The job then sets the
-queue- into the Error state and gets re-run, setting that queue into the Error
state, etc. The result is that the entire cluster gets suddenly brought to a
standstill by one job's problem.

Is there some easy way to have these errors correctly mark the 'job' as the
error rather than the queue or, better yet, stop these from occurring
altogether? The directories referenced (/scratch/JID.1.QUEUE/) do not appear to
exist for any job on any compute server, so I'm at a loss as to why these jobs
want to look there. In general, the files appear in our spool directories, but
without the queue name appended:
[bevan at alexander grid]$ find . -name pid
./default/spool/gallium/active_jobs/487277.1/pid
./default/spool/caesium/active_jobs/832651.1/pid
./default/spool/iodine/active_jobs/838053.1/pid
./default/spool/tin/active_jobs/832986.1/pid
./default/spool/aluminium/active_jobs/838885.1/pid
./default/spool/ruthenium/active_jobs/832081.1/pid
./default/spool/ruthenium/active_jobs/834522.1/pid
./default/spool/arsenic/active_jobs/828264.1/pid
./default/spool/arsenic/active_jobs/838866.1/pid
./default/spool/promethium/active_jobs/833225.1/pid
./default/spool/ytterbium/active_jobs/827839.1/pid
./default/spool/osmium/active_jobs/827987.1/pid
[bevan at alexander grid]$ find . -name job_pid
./default/spool/gallium/active_jobs/487277.1/job_pid
./default/spool/caesium/active_jobs/832651.1/job_pid
./default/spool/iodine/active_jobs/838053.1/job_pid
./default/spool/tin/active_jobs/832986.1/job_pid
./default/spool/aluminium/active_jobs/838885.1/job_pid
./default/spool/ruthenium/active_jobs/832081.1/job_pid
./default/spool/ruthenium/active_jobs/834522.1/job_pid
./default/spool/arsenic/active_jobs/828264.1/job_pid
./default/spool/arsenic/active_jobs/838866.1/job_pid
./default/spool/promethium/active_jobs/833225.1/job_pid
./default/spool/ytterbium/active_jobs/827839.1/job_pid
./default/spool/osmium/active_jobs/827987.1/job_pid

/scratch is set as our "tmp directory", if that helps...



Two examples:

Job 835775 caused action: Queue "all.q at palladium.internal.avlsi.com" set to ERROR
...
failed before job:08/27/2007 16:22:18 [0:25331]: can't open file
/scratch/835775.1.all.q/pid: No such file or director
Shepherd trace:
08/27/2007 16:22:18 [5143:25331]: shepherd called with uid = 0, euid = 5143
08/27/2007 16:22:18 [5143:25331]: starting up 6.1u2
08/27/2007 16:22:18 [5143:25331]: setpgid(25331, 25331) returned 0
08/27/2007 16:22:18 [5143:25331]: no prolog script to start
08/27/2007 16:22:18 [5143:25332]: processing qlogin job
08/27/2007 16:22:18 [5143:25332]: pid=25332 pgrp=25332 sid=25332 old pgrp=25331
getlogin()=<no login set>
08/27/2007 16:22:18 [5143:25332]: reading passwd information for user 'root'
08/27/2007 16:22:18 [5143:25331]: forked "job" with pid 25332
08/27/2007 16:22:18 [5143:25332]: setosjobid: uid = 0, euid = 5143
08/27/2007 16:22:18 [5143:25331]: child: job - pid: 25332
08/27/2007 16:22:18 [5143:25332]: setting limits
08/27/2007 16:22:18 [5143:25332]: RLIMIT_CPU setting: (soft 18446744073709551615
hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
18446744073709551615)
08/27/2007 16:22:18 [5143:25332]: RLIMIT_FSIZE setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
08/27/2007 16:22:18 [5143:25332]: RLIMIT_DATA setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
08/27/2007 16:22:18 [5143:25332]: RLIMIT_STACK setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
08/27/2007 16:22:18 [5143:25332]: RLIMIT_CORE setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
08/27/2007 16:22:18 [5143:25332]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
08/27/2007 16:22:18 [5143:25332]: RLIMIT_RSS setting: (soft 18446744073709551615
hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
18446744073709551615)
08/27/2007 16:22:18 [5143:25332]: setting environment
08/27/2007 16:22:18 [5143:25332]: Initializing error file
08/27/2007 16:22:18 [5143:25332]: switching to intermediate/target user
08/27/2007 16:22:18 [9114:25332]: closing all filedescriptors
08/27/2007 16:22:18 [9114:25332]: further messages are in "error" and "trace"
08/27/2007 16:22:18 [0:25332]: now running with uid=0, euid=0
08/27/2007 16:22:18 [0:25332]: start qlogin
08/27/2007 16:22:18 [0:25332]: calling
qlogin_starter(/mnt/fulcrum/local/common/grid-6.0/default/spool/palladium/active_jobs/835775.1,
/usr/sbin/sshd-grid -i);
08/27/2007 16:22:18 [0:25332]: uid = 0, euid = 0, gid = 0, egid = 0
08/27/2007 16:22:18 [0:25332]: using sfd 1
08/27/2007 16:22:18 [0:25332]: bound to port 34327
08/27/2007 16:22:18 [0:25332]: write_to_qrsh - data =
0:34327:/usr/local/grid-6.0/utilbin/lx24-amd64:/mnt/fulcrum/local/common/grid-6.0/default/spool/palladium/active_jobs/835775.1:palladium.internal.avlsi.com
08/27/2007 16:22:18 [0:25332]: write_to_qrsh - address = napoleon:52957
08/27/2007 16:22:18 [0:25332]: write_to_qrsh - host = napoleon, port = 52957
08/27/2007 16:22:18 [0:25332]: waiting for connection.
08/27/2007 16:22:18 [0:25332]: accepted connection on fd 2
08/27/2007 16:22:18 [0:25332]: daemon to start: |/usr/sbin/sshd-grid -i|
08/27/2007 16:22:18 [5143:25331]: wait3 returned 25332 (status: 0; WIFSIGNALED:
0,  WIFEXITED: 1, WEXITSTATUS: 0)
08/27/2007 16:22:18 [5143:25331]: job exited with exit status 0
08/27/2007 16:22:18 [5143:25331]: reaped "job" with pid 25332
08/27/2007 16:22:18 [5143:25331]: job exited not due to signal
08/27/2007 16:22:18 [5143:25331]: job exited with status 0
08/27/2007 16:22:18 [0:25331]: can't open file /scratch/835775.1.all.q/pid: No
such file or directory
08/27/2007 16:22:18 [0:25331]: write_to_qrsh - data = 1:can't open file
/scratch/835775.1.all.q/pid: No such file or directory
08/27/2007 16:22:18 [0:25331]: write_to_qrsh - address = napoleon:52957
08/27/2007 16:22:18 [0:25331]: write_to_qrsh - host = napoleon, port = 52957

Shepherd error:
08/27/2007 16:22:18 [0:25331]: can't open file /scratch/835775.1.all.q/pid: No
such file or directory

------------------------------------------------------------------------------
Job 838860 caused action: Queue "all.q at indium.internal.avlsi.com" set to ERROR
...
failed before job:08/28/2007 15:59:05 [0:4829]: can't open file job_pid:
Permission denied
Shepherd trace:
08/28/2007 15:59:05 [5143:4828]: shepherd called with uid = 0, euid = 5143
08/28/2007 15:59:05 [5143:4828]: starting up 6.1u2
08/28/2007 15:59:05 [5143:4828]: setpgid(4828, 4828) returned 0
08/28/2007 15:59:05 [5143:4828]: no prolog script to start
08/28/2007 15:59:05 [5143:4829]: processing qlogin job
08/28/2007 15:59:05 [5143:4829]: pid=4829 pgrp=4829 sid=4829 old pgrp=4828
getlogin()=<no login set>
08/28/2007 15:59:05 [5143:4829]: reading passwd information for user 'root'
08/28/2007 15:59:05 [5143:4829]: setosjobid: uid = 0, euid = 5143
08/28/2007 15:59:05 [5143:4828]: forked "job" with pid 4829
08/28/2007 15:59:05 [5143:4829]: setting limits
08/28/2007 15:59:05 [5143:4829]: RLIMIT_CPU setting: (soft 18446744073709551615
hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
18446744073709551615)
08/28/2007 15:59:05 [5143:4829]: RLIMIT_FSIZE setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
08/28/2007 15:59:05 [5143:4829]: RLIMIT_DATA setting: (soft 18446744073709551615
hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
18446744073709551615)
08/28/2007 15:59:05 [5143:4829]: RLIMIT_STACK setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
08/28/2007 15:59:05 [5143:4829]: RLIMIT_CORE setting: (soft 18446744073709551615
hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
18446744073709551615)
08/28/2007 15:59:05 [5143:4829]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
08/28/2007 15:59:05 [5143:4829]: RLIMIT_RSS setting: (soft 18446744073709551615
hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
18446744073709551615)
08/28/2007 15:59:05 [5143:4829]: setting environment
08/28/2007 15:59:05 [5143:4828]: child: job - pid: 4829
08/28/2007 15:59:05 [5143:4829]: Initializing error file
08/28/2007 15:59:05 [5143:4829]: switching to intermediate/target user
08/28/2007 15:59:05 [517:4829]: closing all filedescriptors
08/28/2007 15:59:05 [517:4829]: further messages are in "error" and "trace"
08/28/2007 15:59:05 [0:4829]: now running with uid=0, euid=0
08/28/2007 15:59:05 [0:4829]: start qlogin
08/28/2007 15:59:05 [0:4829]: calling
qlogin_starter(/mnt/fulcrum/local/common/grid-6.0/default/spool/indium/active_jobs/838860.1,
/usr/sbin/sshd-grid -i);
08/28/2007 15:59:05 [0:4829]: uid = 0, euid = 0, gid = 0, egid = 0
08/28/2007 15:59:05 [0:4829]: using sfd 1
08/28/2007 15:59:05 [0:4829]: bound to port 56480
08/28/2007 15:59:05 [0:4829]: write_to_qrsh - data =
0:56480:/usr/local/grid-6.0/utilbin/lx24-amd64:/mnt/fulcrum/local/common/grid-6.0/default/spool/indium/active_jobs/838860.1:indium.internal.avlsi.com
08/28/2007 15:59:05 [0:4829]: write_to_qrsh - address = mithridates:58471
08/28/2007 15:59:05 [0:4829]: write_to_qrsh - host = mithridates, port = 58471
08/28/2007 15:59:05 [0:4829]: error connecting stream socket: Connection refused
08/28/2007 15:59:05 [0:4829]: communication with qrsh failed
08/28/2007 15:59:05 [0:4829]: forked "job" with pid 0
08/28/2007 15:59:05 [0:4829]: can't open file job_pid: Permission denied
08/28/2007 15:59:05 [0:4829]: write_to_qrsh - data = 1:can't open file job_pid:
Permission denied
08/28/2007 15:59:05 [0:4829]: write_to_qrsh - address = mithridates
08/28/2007 15:59:05 [0:4829]: illegal value for qrsh_control_port:
"mithridates". Should be host:port
08/28/2007 15:59:05 [5143:4828]: wait3 returned 4829 (status: 2816; WIFSIGNALED:
0,  WIFEXITED: 1, WEXITSTATUS: 11)
08/28/2007 15:59:05 [5143:4828]: job exited with exit status 11
08/28/2007 15:59:05 [5143:4828]: reaped "job" with pid 4829
08/28/2007 15:59:05 [5143:4828]: job exited not due to signal
08/28/2007 15:59:05 [5143:4828]: job exited with status 11
08/28/2007 15:59:05 [0:4828]: can't open file /scratch/838860.1.all.q/pid: No
such file or directory
08/28/2007 15:59:05 [0:4828]: write_to_qrsh - data = 1:can't open file
/scratch/838860.1.all.q/pid: No such file or directory
08/28/2007 15:59:05 [0:4828]: write_to_qrsh - address = mithridates:58471
08/28/2007 15:59:05 [0:4828]: write_to_qrsh - host = mithridates, port = 58471
08/28/2007 15:59:05 [0:4828]: error connecting stream socket: Connection refused

Shepherd error:
08/28/2007 15:59:05 [0:4829]: can't open file job_pid: Permission denied
08/28/2007 15:59:05 [0:4828]: can't open file /scratch/838860.1.all.q/pid: No
such file or directory

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list