Opened 13 years ago
Last modified 10 years ago
#518 new defect
IZ2576: repeated queue instance error reasons accumulate in qstat -j <jobid> output
Reported by: | andreas | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 6.1u3 |
Severity: | Keywords: | qmaster | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2576]
Issue #: 2576 Platform: All Reporter: andreas (andreas) Component: gridengine OS: All Subcomponent: qmaster Version: 6.1u3 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: ernst (ernst) QA Contact: ernst URL: * Summary: repeated queue instance error reasons accumulate in qstat -j <jobid> output Status whiteboard: Attachments: Issue 2576 blocks: Votes for issue 2576: Opened: Mon May 19 04:48:00 -0700 2008 ------------------------ When a job can not be started repeatedly due queue misconfiguration (e.g. wrong prolog in queue_conf(5)) the error reasons accumulate in qstat -j <jobid> output. This is double fairly strange because (a) there is no reason to store per queue instance error reasons in per job data structore and (b) if this is done these messages may not accumulate. (1) The erros below I got with a prolog script that could not be executed and keep_alive=true in sge_conf(5). (2) After job failure due to wrong prolog I ran qmod -c a couple of times without getting the job through. (3) Having removed the broken prolog configuration I ran qmod -c another couple of times without getting the job trough due to keep_alive=true resulted in 05/19/2008 13:30:40|execd|es-ergb01-01|E|can't start job "1075009": can't create directory active_jobs/1075009.1: File exists (4) Having removed the keep_alive I did the qmod -c again a couple of times (5) When the job ran finally qstat -j 1075009 got me this output below with six job error reasons! > qstat -j 1075009 ============================================================== job_number: 1075009 exec_file: job_scripts/1075009 submission_time: Mon May 19 13:13:32 2008 owner: ah114088 uid: 115088 group: staff gid: 10 sge_o_home: /home/ah114088 sge_o_log_name: ah114088 sge_o_path: /gridware/InhouseSystems/sge61u3/bin/sol-sparc64:/home/ah114088/bin:/usr/dt/bin:/usr/openwin/bin:/usr/ccs/bin:/vol2/tools/SW/j2sdk1.4.2/solaris64/bin:/vol2/tools/SW/bin:/vol2/tools/SW/solaris64/bin:/vol2/tools/common/solaris64/bin:/usr/local/bin:/sbin/:/us r/sbin:/opt/SUNWspro/bin:/usr/dist/exe:/usr/dist/local/exe:/sbin:/usr/bin:/usr/ucb:/usr/sfw/bin:/usr/lib/lp/postscript:. sge_o_shell: /bin/tcsh sge_o_tz: MET sge_o_workdir: /home/ah114088 sge_o_host: es-ergb01-01 account: sge mail_list: ah114088@es-ergb01-01 notify: FALSE job_name: sleep jobshare: 0 hard_queue_list: *@es-ergb01-01 env_list: job_args: 300 script_file: /bin/sleep usage 1: cpu=00:00:00, mem=0.00011 GBs, io=0.00000, vmem=2.984M, maxvmem=2.984M error reason 1: 05/19/2008 13:13:34 [150:1239]: exit_status of prolog = 1 1: can't create directory active_jobs/1075009.1: File exists 1: can't create directory active_jobs/1075009.1: File exists 1: can't create directory active_jobs/1075009.1: File exists 1: can't create directory active_jobs/1075009.1: File exists 1: can't create directory active_jobs/1075009.1: File exists scheduling info: queue instance "tight.q@angbor" dropped because it is temporarily not available :
Note: See
TracTickets for help on using
tickets.