[GE users] Stale finished jobs

Reuti reuti at staff.uni-marburg.de
Wed Dec 5 08:46:28 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

Zitat von Norbert Crettol <norbert.crettol at idiap.ch>:

> Norbert Crettol wrote:
>> The only problem we have is that some jobs remain in queue
>> although they have been terminated correctly. When I look
>> into the node, nothing is running anymore. And I have
>> to force delete them to remove them from the queues.
>> This is not related to a node, neither to a type of job. I
>> personally ran many thousands short jobs, allways the same
>> binary and had an average of about 1 to 2 stale jobs out
>> of a thousand. But some people reported a bigger average.
> I have made more tests and have more precisions :
> - when I get a stale job in a queue, on the node I can see the
>  shepherd (as user sge) running another shepherd (as root)
>  instead of  running the job script.
>
> This is what I get on the node :
> beo-02:~# ps auxwf | grep s[g]e
> sge       3247  0.1  0.0 28368 2892 ?        S    Nov30   9:54
> ..binpath../sge_execd
> sge       1433  0.0  0.0 27776 2276 ?        S    17:11   0:00  \_
> sge_shepherd-341511 -bg
> root      1437  0.0  0.0 27840 1720 ?        Ss   17:11   0:00  |   \_
> sge_shepherd-341511 -bg
> sge       2684  0.0  0.0 27780 2280 ?        S    17:43   0:00  \_
> sge_shepherd-341951 -bg
> aradilla  2685  0.0  0.0  5692 1200 ?        Ss   17:43   0:00  |   \_
> /bin/sh ..spoolpath../job_scripts/341951
> sge       2931  0.0  0.0 27776 2276 ?        S    17:47   0:00  \_
> sge_shepherd-342008 -bg
> nc        2932  0.0  0.0  5684 1200 ?        Ss   17:47   0:00  |   \_
> /bin/sh ..spoolpath../job_scripts/342008
> sge       2933  0.0  0.0 27784 2280 ?        S    17:47   0:00  \_
> sge_shepherd-342442 -bg
> nc        2934  0.0  0.0  5684 1200 ?        Ss   17:47   0:00      \_
> /bin/sh ..spoolpath../job_scripts/342442
>
> We can see 4 jobs running here. The first one (341511) is the
> stale one, the others are normal ones. For clarity, I've
> replaced the real (long) paths. The first job should have
> been the same user as the third and the fourth.

so the job never ran? I saw this, when the user has no rights to read  
the spooled jobscript on the node or it's not created there at all.  
I.e. the "exec" of the fork to be replaced with the actual jobscript  
fails. Is the spool directory for the nodes also in  
$SGE_ROOT/default/spool/<node>/... or somewhere in /var/spool/sge  
local on the node?

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list