[GE users] problem: finished jobs still listed as running

Reuti reuti at staff.uni-marburg.de
Thu Nov 15 16:22:06 GMT 2007


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi,

Am 15.11.2007 um 14:08 schrieb Thomas Junier:

> I seem to have the following problem with my grid engine: some jobs  
> are listed
> as running (according to qstat) while they are in fact already  
> finished.
>
> If I list the jobs, I get this:
>
> $ qstat
> job-ID  prior   name       user         state submit/start at      
> queue                          slots ja-task-ID
> ---------------------------------------------------------------------- 
> -------------------------------------------
> 157632 0.55500 STDIN      user       r     11/14/2007 21:14:47  
> all.q at b1.mycluster.ch       1
> 157655 0.55500 STDIN      user       r     11/14/2007 23:38:32  
> all.q at b1.mycluster.ch       1
> 157641 0.55500 STDIN      user       r     11/14/2007 22:07:02  
> all.q at b2.mycluster.ch       1
> 157647 0.55500 STDIN      user       r     11/14/2007 22:31:17  
> all.q at b2.mycluster.ch       1
> 157618 0.55500 STDIN      user       r     11/14/2007 19:40:17  
> all.q at b3.mycluster.ch       1
> 157657 0.55500 STDIN      user       r     11/14/2007 23:45:47  
> all.q at b3.mycluster.ch       1
> 157721 0.55500 STDIN      user       r     11/15/2007 08:17:47  
> all.q at b4.mycluster.ch       1
> 157723 0.55500 STDIN      user       r     11/15/2007 08:38:02  
> all.q at b4.mycluster.ch       1
> 157731 0.55500 STDIN      user       r     11/15/2007 09:39:47  
> all.q at b5.mycluster.ch       1
> 157736 0.55500 STDIN      user       r     11/15/2007 10:08:02  
> all.q at b5.mycluster.ch       1
> 157728 0.55500 STDIN      user       r     11/15/2007 09:01:47  
> all.q at b6.mycluster.ch       1
> 157738 0.55500 STDIN      user       r     11/15/2007 10:17:02  
> all.q at b6.mycluster.ch       1
> 157648 0.55500 STDIN      user       r     11/14/2007 22:36:17  
> all.q at b7.mycluster.ch       1
> 157658 0.55500 STDIN      user       r     11/14/2007 23:45:47  
> all.q at b7.mycluster.ch       1
> 157720 0.55500 STDIN      user       r     11/15/2007 08:09:17  
> all.q at b8.mycluster.ch       1
> 157725 0.55500 STDIN      user       r     11/15/2007 08:48:32  
> all.q at b8.mycluster.ch       1
>
> This looks fine, but some jobs are in fact already done.  For  
> instance the
> sge_execd on b3 does not have any children processes at the moment:
>
> [b3]$ pstree
> ...
>        ??resmgrd(3708)
>        ??rpciod(4564)
>        ??rsync(3729)
>        ??sge_execd(19689)
>        ??slpd(3759)
> ...
>
> And I checked that the jobs did complete as opposed to just hang.
>
> Also, although all these jobs appear to be running (according to  
> qstat), some
> of the execution hosts seem to have problems:
>
> $ qhost
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE   
> SWAPTO  SWAPUS
> ---------------------------------------------------------------------- 
> ---------
> global                  -               -     -       -        
> -       -       -
> b1.mycluster.ch  lx24-amd64      2     -    3.8G       -     
> 1.0G       -
> b2.mycluster.ch  lx24-amd64      2     -    3.8G       -   
> 980.5M       -
> b3.mycluster.ch  lx24-amd64      2     -    3.8G       -   
> 980.5M       -
> b4.mycluster.ch  lx24-amd64      2  2.30    3.8G  159.6M   
> 980.5M     0.0
> b5.mycluster.ch  lx24-amd64      2  2.21    3.8G  164.7M   
> 980.5M     0.0
> b6.mycluster.ch  lx24-amd64      2  2.31    3.8G  167.3M   
> 980.5M     0.0
> b7.mycluster.ch  lx24-amd64      2     -    3.8G       -   
> 980.5M       -
> b8.mycluster.ch  lx24-amd64      2  2.29    3.8G  158.7M   
> 980.5M     0.0

so the sge_execd is no longer running and/or responding on b3 - was  
any firewall or so switched on? Or any change of the TCP/IP address?  
SGE will assume a temporary network problem and wait for the node the  
appear again. Then it will check the running jobs on the node. If the  
job really finished in the meantime, it will be removed from the  
processlist.

> b1, b2, b2 and b7 do not respond, apparently.
>
> So I'm a bit confused - if the job is done, why does qstat still  
> list it? Is
> this the cause of b3 not accepting jobs, or its consequence (or is  
> the cause
> something else altogether) ?
>
> I tried to see what was wrong with the queue:
>
> $ qstat -explain E
> ...
> all.q at b3.mycluster.ch   BIP   2/2       -NA-     lx24-amd64    au
> 157618 0.55500 STDIN      user       r     11/14/2007 19:40:17     1
> 157657 0.55500 STDIN      user       r     11/14/2007 23:45:47     1
> ...
>
> Hm, apparently the sge_execd on b3 can't be contacted. So I checked  
> with netstat
> that sge_execd on b3 is still listening for connections (it is).
> It also has an established connection to the submit host (not sure  
> why since
> it's not doing anything).
>
> I tried qstat -j <job> -explain E on one of the problem jobs, but  
> didn't find
> anything obvious (see ouput at the end of this post).
>
> I found nothing in $SGE_ROOT/default/spool/b3/messages pertaining  
> to this problem.
> I looked in http://gridengine.sunsource.net/howto/ 
> commonproblems.html - no luck.

Some additional information might be in /tmp of the node.

-- Reuti
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list