[GE users] problem: finished jobs still listed as running

Thomas Junier Thomas.Junier at medecine.unige.ch
Thu Nov 15 13:08:32 GMT 2007


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]


Hi folks,

I seem to have the following problem with my grid engine: some jobs are 
listed
as running (according to qstat) while they are in fact already finished.

If I list the jobs, I get this:

$ qstat
job-ID  prior   name       user         state submit/start at     
queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 157632 0.55500 STDIN      user       r     11/14/2007 21:14:47 
all.q at b1.mycluster.ch       1
 157655 0.55500 STDIN      user       r     11/14/2007 23:38:32 
all.q at b1.mycluster.ch       1
 157641 0.55500 STDIN      user       r     11/14/2007 22:07:02 
all.q at b2.mycluster.ch       1
 157647 0.55500 STDIN      user       r     11/14/2007 22:31:17 
all.q at b2.mycluster.ch       1
 157618 0.55500 STDIN      user       r     11/14/2007 19:40:17 
all.q at b3.mycluster.ch       1
 157657 0.55500 STDIN      user       r     11/14/2007 23:45:47 
all.q at b3.mycluster.ch       1
 157721 0.55500 STDIN      user       r     11/15/2007 08:17:47 
all.q at b4.mycluster.ch       1
 157723 0.55500 STDIN      user       r     11/15/2007 08:38:02 
all.q at b4.mycluster.ch       1
 157731 0.55500 STDIN      user       r     11/15/2007 09:39:47 
all.q at b5.mycluster.ch       1
 157736 0.55500 STDIN      user       r     11/15/2007 10:08:02 
all.q at b5.mycluster.ch       1
 157728 0.55500 STDIN      user       r     11/15/2007 09:01:47 
all.q at b6.mycluster.ch       1
 157738 0.55500 STDIN      user       r     11/15/2007 10:17:02 
all.q at b6.mycluster.ch       1
 157648 0.55500 STDIN      user       r     11/14/2007 22:36:17 
all.q at b7.mycluster.ch       1
 157658 0.55500 STDIN      user       r     11/14/2007 23:45:47 
all.q at b7.mycluster.ch       1
 157720 0.55500 STDIN      user       r     11/15/2007 08:09:17 
all.q at b8.mycluster.ch       1
 157725 0.55500 STDIN      user       r     11/15/2007 08:48:32 
all.q at b8.mycluster.ch       1

This looks fine, but some jobs are in fact already done.  For instance the
sge_execd on b3 does not have any children processes at the moment:

[b3]$ pstree
...
        ??resmgrd(3708)
        ??rpciod(4564)
        ??rsync(3729)
        ??sge_execd(19689)
        ??slpd(3759)
...

And I checked that the jobs did complete as opposed to just hang.

Also, although all these jobs appear to be running (according to qstat), 
some
of the execution hosts seem to have problems:

$ qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  
SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       
-       -
b1.mycluster.ch  lx24-amd64      2     -    3.8G       -    1.0G       -
b2.mycluster.ch  lx24-amd64      2     -    3.8G       -  980.5M       -
b3.mycluster.ch  lx24-amd64      2     -    3.8G       -  980.5M       -
b4.mycluster.ch  lx24-amd64      2  2.30    3.8G  159.6M  980.5M     0.0
b5.mycluster.ch  lx24-amd64      2  2.21    3.8G  164.7M  980.5M     0.0
b6.mycluster.ch  lx24-amd64      2  2.31    3.8G  167.3M  980.5M     0.0
b7.mycluster.ch  lx24-amd64      2     -    3.8G       -  980.5M       -
b8.mycluster.ch  lx24-amd64      2  2.29    3.8G  158.7M  980.5M     0.0

b1, b2, b2 and b7 do not respond, apparently.

So I'm a bit confused - if the job is done, why does qstat still list it? Is
this the cause of b3 not accepting jobs, or its consequence (or is the cause
something else altogether) ?

I tried to see what was wrong with the queue:

$ qstat -explain E
...
all.q at b3.mycluster.ch   BIP   2/2       -NA-     lx24-amd64    au
 157618 0.55500 STDIN      user       r     11/14/2007 19:40:17     1
 157657 0.55500 STDIN      user       r     11/14/2007 23:45:47     1
...

Hm, apparently the sge_execd on b3 can't be contacted. So I checked with 
netstat
that sge_execd on b3 is still listening for connections (it is).
It also has an established connection to the submit host (not sure why since
it's not doing anything).

I tried qstat -j <job> -explain E on one of the problem jobs, but didn't 
find
anything obvious (see ouput at the end of this post).

I found nothing in $SGE_ROOT/default/spool/b3/messages pertaining to 
this problem.
I looked in http://gridengine.sunsource.net/howto/commonproblems.html - 
no luck.

I'm using n1ge6_0u8.

Any help much appreciated.


Thanks in advance,


Thomas





Output of qstat -explain E
--------------------------

$ qstat -j 157618 -explain E
==============================================================
job_number:                 157618
exec_file:                  job_scripts/157618
submission_time:            Wed Nov 14 14:26:33 2007
owner:                      user
uid:                        1010
group:                      users
gid:                        100
sge_o_home:                 /home/user
sge_o_log_name:             user
sge_o_path:                 
/usr/local/sge-root/bin/lx24-amd64:/home/user/bin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:/opt/kde3/bin:.:/usr/lib/java/jre/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/user
sge_o_host:                 sge_master
account:                    sge
stderr_path_list:           /home/user/sge2
mail_list:                  user at mydomain.ch
notify:                     FALSE
job_name:                   STDIN
stdout_path_list:           /home/user/sge2
jobshare:                   0
env_list:
script_file:                STDIN
usage    1:                 cpu=02:10:51, mem=115.75412 GBs, io=0.00000, 
vmem=98.219M, maxvmem=182.086M
scheduling info:            queue instance "all.q at b2.mycluster.ch" 
dropped because it is temporarily not available
                            queue instance "all.q at b7.mycluster.ch" 
dropped because it is temporarily not available
                            queue instance "all.q at b3.mycluster.ch" 
dropped because it is temporarily not available
                            queue instance "all.q at b1.mycluster.ch" 
dropped because it is temporarily not available
                            queue instance "all.q at b8.mycluster.ch" 
dropped because it is full
                            queue instance "all.q at b9.mycluster.ch" 
dropped because it is full
                            queue instance "all.q at b10.mycluster.ch" 
dropped because it is full
                            queue instance "all.q at b14.mycluster.ch" 
dropped because it is full
                            queue instance "all.q at b12.mycluster.ch" 
dropped because it is full
                            queue instance "all.q at b5.mycluster.ch" 
dropped because it is full
                            queue instance "all.q at b13.mycluster.ch" 
dropped because it is full
                            queue instance "all.q at b6.mycluster.ch" 
dropped because it is full
                            queue instance "all.q at b15.mycluster.ch" 
dropped because it is full
                            queue instance "all.q at b11.mycluster.ch" 
dropped because it is full
                            queue instance "all.q at b4.mycluster.ch" 
dropped because it is full
                            All queues dropped because of overload or full

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list