Opened 11 years ago

Last modified 9 years ago

#518 new defect

IZ2576: repeated queue instance error reasons accumulate in qstat -j <jobid> output

Reported by: andreas Owned by:
Priority: normal Milestone:
Component: sge Version: 6.1u3
Severity: Keywords: qmaster
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2576]

        Issue #:      2576             Platform:     All      Reporter: andreas (andreas)
       Component:     gridengine          OS:        All
     Subcomponent:    qmaster          Version:      6.1u3       CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    ernst (ernst)
      QA Contact:     ernst
          URL:
       * Summary:     repeated queue instance error reasons accumulate in qstat -j <jobid> output
   Status whiteboard:
      Attachments:

     Issue 2576 blocks:
   Votes for issue 2576:


   Opened: Mon May 19 04:48:00 -0700 2008 
------------------------


When a job can not be started repeatedly due queue misconfiguration (e.g. wrong
prolog in queue_conf(5)) the error reasons accumulate in qstat -j <jobid>
output. This is double fairly strange because (a) there is no reason to store
per queue instance error reasons in per job data structore and (b) if this is
done these messages may not accumulate.

(1) The erros below I got with a prolog script that could not be executed and
keep_alive=true in sge_conf(5).
(2) After job failure due to wrong prolog I ran qmod -c a couple of times
without getting the job through.
(3) Having removed the broken prolog configuration I ran qmod -c another couple
of times without getting the job trough due to keep_alive=true resulted in

   05/19/2008 13:30:40|execd|es-ergb01-01|E|can't start job "1075009": can't
create directory active_jobs/1075009.1: File exists
(4) Having removed the keep_alive I did the qmod -c again a couple of times
(5) When the job ran finally qstat -j 1075009 got me this output below with six
job error reasons!


> qstat -j 1075009
==============================================================
job_number:                 1075009
exec_file:                  job_scripts/1075009
submission_time:            Mon May 19 13:13:32 2008
owner:                      ah114088
uid:                        115088
group:                      staff
gid:                        10
sge_o_home:                 /home/ah114088
sge_o_log_name:             ah114088
sge_o_path:
/gridware/InhouseSystems/sge61u3/bin/sol-sparc64:/home/ah114088/bin:/usr/dt/bin:/usr/openwin/bin:/usr/ccs/bin:/vol2/tools/SW/j2sdk1.4.2/solaris64/bin:/vol2/tools/SW/bin:/vol2/tools/SW/solaris64/bin:/vol2/tools/common/solaris64/bin:/usr/local/bin:/sbin/:/us
r/sbin:/opt/SUNWspro/bin:/usr/dist/exe:/usr/dist/local/exe:/sbin:/usr/bin:/usr/ucb:/usr/sfw/bin:/usr/lib/lp/postscript:.
sge_o_shell:                /bin/tcsh
sge_o_tz:                   MET
sge_o_workdir:              /home/ah114088
sge_o_host:                 es-ergb01-01
account:                    sge
mail_list:                  ah114088@es-ergb01-01
notify:                     FALSE
job_name:                   sleep
jobshare:                   0
hard_queue_list:            *@es-ergb01-01
env_list:
job_args:                   300
script_file:                /bin/sleep
usage    1:                 cpu=00:00:00, mem=0.00011 GBs, io=0.00000,
vmem=2.984M, maxvmem=2.984M
error reason    1:          05/19/2008 13:13:34 [150:1239]: exit_status of
prolog = 1
                1:          can't create directory active_jobs/1075009.1: File
exists
                1:          can't create directory active_jobs/1075009.1: File
exists
                1:          can't create directory active_jobs/1075009.1: File
exists
                1:          can't create directory active_jobs/1075009.1: File
exists
                1:          can't create directory active_jobs/1075009.1: File
exists
scheduling info:            queue instance "tight.q@angbor" dropped because it
is temporarily not available
                                     :

Change History (0)

Note: See TracTickets for help on using tickets.