[GE users] problem: finished jobs still listed as running

Daniel Templeton Dan.Templeton at Sun.COM
Thu Nov 15 15:54:02 GMT 2007


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Thomas,

So, the reason the jobs are still listed as running is that the 
execution daemon has gone AWOL.  You can see that by the "au" state in 
qstat.  When an execution daemon goes missing, we leave its jobs in the 
last reported state, until the execution daemon comes back and gives a 
new status update.  (You can also configure jobs to be restarted after a 
grace period after an execd goes missing.)

As for why your execd is still alive, still connected, and still dead, I 
don't have much to say.  The execd is a big processing loop, so it's 
possible that it ran into a problem and got stuck.  Have you tried 
restarting it?  It will have no effect on your running jobs.

Daniel

Thomas Junier wrote:
>
> Hi folks,
>
> I seem to have the following problem with my grid engine: some jobs 
> are listed
> as running (according to qstat) while they are in fact already finished.
>
> If I list the jobs, I get this:
>
> $ qstat
> job-ID  prior   name       user         state submit/start at     
> queue                          slots ja-task-ID
> ----------------------------------------------------------------------------------------------------------------- 
>
> 157632 0.55500 STDIN      user       r     11/14/2007 21:14:47 
> all.q at b1.mycluster.ch       1
> 157655 0.55500 STDIN      user       r     11/14/2007 23:38:32 
> all.q at b1.mycluster.ch       1
> 157641 0.55500 STDIN      user       r     11/14/2007 22:07:02 
> all.q at b2.mycluster.ch       1
> 157647 0.55500 STDIN      user       r     11/14/2007 22:31:17 
> all.q at b2.mycluster.ch       1
> 157618 0.55500 STDIN      user       r     11/14/2007 19:40:17 
> all.q at b3.mycluster.ch       1
> 157657 0.55500 STDIN      user       r     11/14/2007 23:45:47 
> all.q at b3.mycluster.ch       1
> 157721 0.55500 STDIN      user       r     11/15/2007 08:17:47 
> all.q at b4.mycluster.ch       1
> 157723 0.55500 STDIN      user       r     11/15/2007 08:38:02 
> all.q at b4.mycluster.ch       1
> 157731 0.55500 STDIN      user       r     11/15/2007 09:39:47 
> all.q at b5.mycluster.ch       1
> 157736 0.55500 STDIN      user       r     11/15/2007 10:08:02 
> all.q at b5.mycluster.ch       1
> 157728 0.55500 STDIN      user       r     11/15/2007 09:01:47 
> all.q at b6.mycluster.ch       1
> 157738 0.55500 STDIN      user       r     11/15/2007 10:17:02 
> all.q at b6.mycluster.ch       1
> 157648 0.55500 STDIN      user       r     11/14/2007 22:36:17 
> all.q at b7.mycluster.ch       1
> 157658 0.55500 STDIN      user       r     11/14/2007 23:45:47 
> all.q at b7.mycluster.ch       1
> 157720 0.55500 STDIN      user       r     11/15/2007 08:09:17 
> all.q at b8.mycluster.ch       1
> 157725 0.55500 STDIN      user       r     11/15/2007 08:48:32 
> all.q at b8.mycluster.ch       1
>
> This looks fine, but some jobs are in fact already done.  For instance 
> the
> sge_execd on b3 does not have any children processes at the moment:
>
> [b3]$ pstree
> ...
>        ??resmgrd(3708)
>        ??rpciod(4564)
>        ??rsync(3729)
>        ??sge_execd(19689)
>        ??slpd(3759)
> ...
>
> And I checked that the jobs did complete as opposed to just hang.
>
> Also, although all these jobs appear to be running (according to 
> qstat), some
> of the execution hosts seem to have problems:
>
> $ qhost
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  
> SWAPTO  SWAPUS
> ------------------------------------------------------------------------------- 
>
> global                  -               -     -       -       -       
> -       -
> b1.mycluster.ch  lx24-amd64      2     -    3.8G       -    1.0G       -
> b2.mycluster.ch  lx24-amd64      2     -    3.8G       -  980.5M       -
> b3.mycluster.ch  lx24-amd64      2     -    3.8G       -  980.5M       -
> b4.mycluster.ch  lx24-amd64      2  2.30    3.8G  159.6M  980.5M     0.0
> b5.mycluster.ch  lx24-amd64      2  2.21    3.8G  164.7M  980.5M     0.0
> b6.mycluster.ch  lx24-amd64      2  2.31    3.8G  167.3M  980.5M     0.0
> b7.mycluster.ch  lx24-amd64      2     -    3.8G       -  980.5M       -
> b8.mycluster.ch  lx24-amd64      2  2.29    3.8G  158.7M  980.5M     0.0
>
> b1, b2, b2 and b7 do not respond, apparently.
>
> So I'm a bit confused - if the job is done, why does qstat still list 
> it? Is
> this the cause of b3 not accepting jobs, or its consequence (or is the 
> cause
> something else altogether) ?
>
> I tried to see what was wrong with the queue:
>
> $ qstat -explain E
> ...
> all.q at b3.mycluster.ch   BIP   2/2       -NA-     lx24-amd64    au
> 157618 0.55500 STDIN      user       r     11/14/2007 19:40:17     1
> 157657 0.55500 STDIN      user       r     11/14/2007 23:45:47     1
> ...
>
> Hm, apparently the sge_execd on b3 can't be contacted. So I checked 
> with netstat
> that sge_execd on b3 is still listening for connections (it is).
> It also has an established connection to the submit host (not sure why 
> since
> it's not doing anything).
>
> I tried qstat -j <job> -explain E on one of the problem jobs, but 
> didn't find
> anything obvious (see ouput at the end of this post).
>
> I found nothing in $SGE_ROOT/default/spool/b3/messages pertaining to 
> this problem.
> I looked in http://gridengine.sunsource.net/howto/commonproblems.html 
> - no luck.
>
> I'm using n1ge6_0u8.
>
> Any help much appreciated.
>
>
> Thanks in advance,
>
>
> Thomas
>
>
>
>
>
> Output of qstat -explain E
> --------------------------
>
> $ qstat -j 157618 -explain E
> ==============================================================
> job_number:                 157618
> exec_file:                  job_scripts/157618
> submission_time:            Wed Nov 14 14:26:33 2007
> owner:                      user
> uid:                        1010
> group:                      users
> gid:                        100
> sge_o_home:                 /home/user
> sge_o_log_name:             user
> sge_o_path:                 
> /usr/local/sge-root/bin/lx24-amd64:/home/user/bin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:/opt/kde3/bin:.:/usr/lib/java/jre/bin 
>
> sge_o_shell:                /bin/bash
> sge_o_workdir:              /home/user
> sge_o_host:                 sge_master
> account:                    sge
> stderr_path_list:           /home/user/sge2
> mail_list:                  user at mydomain.ch
> notify:                     FALSE
> job_name:                   STDIN
> stdout_path_list:           /home/user/sge2
> jobshare:                   0
> env_list:
> script_file:                STDIN
> usage    1:                 cpu=02:10:51, mem=115.75412 GBs, 
> io=0.00000, vmem=98.219M, maxvmem=182.086M
> scheduling info:            queue instance "all.q at b2.mycluster.ch" 
> dropped because it is temporarily not available
>                            queue instance "all.q at b7.mycluster.ch" 
> dropped because it is temporarily not available
>                            queue instance "all.q at b3.mycluster.ch" 
> dropped because it is temporarily not available
>                            queue instance "all.q at b1.mycluster.ch" 
> dropped because it is temporarily not available
>                            queue instance "all.q at b8.mycluster.ch" 
> dropped because it is full
>                            queue instance "all.q at b9.mycluster.ch" 
> dropped because it is full
>                            queue instance "all.q at b10.mycluster.ch" 
> dropped because it is full
>                            queue instance "all.q at b14.mycluster.ch" 
> dropped because it is full
>                            queue instance "all.q at b12.mycluster.ch" 
> dropped because it is full
>                            queue instance "all.q at b5.mycluster.ch" 
> dropped because it is full
>                            queue instance "all.q at b13.mycluster.ch" 
> dropped because it is full
>                            queue instance "all.q at b6.mycluster.ch" 
> dropped because it is full
>                            queue instance "all.q at b15.mycluster.ch" 
> dropped because it is full
>                            queue instance "all.q at b11.mycluster.ch" 
> dropped because it is full
>                            queue instance "all.q at b4.mycluster.ch" 
> dropped because it is full
>                            All queues dropped because of overload or full
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list