[GE users] 5.3p6 hang due to failed job (got NULL element for QU_rerun)

Bryan Bayerdorffer bryan.bayerdorffer at analog.com
Wed Jun 9 17:56:45 BST 2004


A job failed today for unknown reasons.  This caused the qmaster to hang after 
printing the following message.  rcsge stop/start got to the same stuck 
state---no jobs dispatching and can't run qstat:



Wed Jun  9 11:38:20 2004|qmaster|qmaster|I|starting up 5.3p6 (sgeee)
Wed Jun  9 11:38:25 2004|qmaster|qmaster|E|writing job finish information: 
can't locate queue "<unknown queue>"
Wed Jun  9 11:38:25 2004|qmaster|qmaster|W|job 1127244.1 failed on host 
<unknown host>  before writing exit_status because: shepherd exited with exit 
status 19
Wed Jun  9 11:38:25 2004|qmaster|qmaster|C|!!!!!!!!!! lGetUlong(): got NULL 
element for QU_rerun !!!!!!!!!!



I fixed this by manually deleting the jobs and job_scripts files for job 1127244

When that job failed earlier the mail I got was:



Job 1127244 caused action: none
  User        = <unknown>
  Queue       = <unknown>
  Host        = <unknown>
  Start Time  = <unknown>
  End Time    = <unknown>
failed before writing exit_status:shepherd exited with exit status 19
Shepherd trace:
06/09/2004 10:57:45 [55:4258]: shepherd called with uid = 0, euid = 55
06/09/2004 10:57:45 [55:4258]: starting up 5.3p6
06/09/2004 10:57:45 [55:4258]: setpgid(4258, 4258) returned 0
06/09/2004 10:57:45 [55:4258]: no prolog script to start
06/09/2004 10:57:45 [55:4258]: forked "job" with pid 4259
06/09/2004 10:57:45 [55:4259]: pid=4259 pgrp=4259 sid=4259 old pgrp=4258 
getlogin()=<no login set>
06/09/2004 10:57:45 [55:4259]: setosjobid: uid = 0, euid = 55
06/09/2004 10:57:45 [55:4258]: child: job - pid: 4259
06/09/2004 10:57:45 [55:4259]: RLIMIT_CPU setting: (soft 18446744073709551613 
hard 18446744073709551613) resulting: (soft 18446744073709551613 hard 
18446744073709551613)
06/09/2004 10:57:45 [55:4259]: RLIMIT_FSIZE setting: (soft 
18446744073709551613 hard 18446744073709551613) resulting: (soft 
18446744073709551613 hard 18446744073709551613)
06/09/2004 10:57:45 [55:4259]: RLIMIT_DATA setting: (soft 18446744073709551613 
hard 18446744073709551613) resulting: (soft 18446744073709551613 hard 
18446744073709551613)
06/09/2004 10:57:45 [55:4259]: RLIMIT_STACK setting: (soft 
18446744073709551613 hard 18446744073709551613) resulting: (soft 
18446744073709551613 hard 18446744073709551613)
06/09/2004 10:57:45 [55:4259]: RLIMIT_CORE setting: (soft 18446744073709551613 
hard 18446744073709551613) resulting: (soft 18446744073709551613 hard 
18446744073709551613)
06/09/2004 10:57:45 [55:4259]: RLIMIT_VMEM setting: (soft 18446744073709551613 
hard 18446744073709551613) resulting: (soft 18446744073709551613 hard 
18446744073709551613)
06/09/2004 10:57:45 [18910:4259]: closing all filedescriptors
06/09/2004 10:57:45 [18910:4259]: further messages are in "error" and "trace"
06/09/2004 10:57:45 [18910:4259]: using stdout as stderr
06/09/2004 10:57:45 [18910:4259]: execvp(/bin/sh, sh 
/usr/local/tools/sge/v5.3p6/default/spool/flood/job_scripts/1126942)

Shepherd pe_hostfile:
add_grp_id=20342
stdout_path=/proj/lemans/p

--

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list