[GE users] Stale finished jobs

Rayson Ho rayrayson at gmail.com
Tue Dec 4 17:46:38 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Try to use strace and see what the processes are doing...

Also, Linux's NFS implementation is IMO not very robust, esp. when the
load is high. Having "local spool directories" should workaround it:
http://gridengine.sunsource.net/howto/nfsreduce.html

Rayson



On Dec 4, 2007 12:07 PM, Norbert Crettol <norbert.crettol at idiap.ch> wrote:
> Norbert Crettol wrote:
> > The only problem we have is that some jobs remain in queue
> > although they have been terminated correctly. When I look
> > into the node, nothing is running anymore. And I have
> > to force delete them to remove them from the queues.
> > This is not related to a node, neither to a type of job. I
> > personally ran many thousands short jobs, allways the same
> > binary and had an average of about 1 to 2 stale jobs out
> > of a thousand. But some people reported a bigger average.
> I have made more tests and have more precisions :
> - when I get a stale job in a queue, on the node I can see the
>   shepherd (as user sge) running another shepherd (as root)
>   instead of  running the job script.
>
> This is what I get on the node :
> beo-02:~# ps auxwf | grep s[g]e
> sge       3247  0.1  0.0 28368 2892 ?        S    Nov30   9:54
> ..binpath../sge_execd
> sge       1433  0.0  0.0 27776 2276 ?        S    17:11   0:00  \_
> sge_shepherd-341511 -bg
> root      1437  0.0  0.0 27840 1720 ?        Ss   17:11   0:00  |   \_
> sge_shepherd-341511 -bg
> sge       2684  0.0  0.0 27780 2280 ?        S    17:43   0:00  \_
> sge_shepherd-341951 -bg
> aradilla  2685  0.0  0.0  5692 1200 ?        Ss   17:43   0:00  |   \_
> /bin/sh ..spoolpath../job_scripts/341951
> sge       2931  0.0  0.0 27776 2276 ?        S    17:47   0:00  \_
> sge_shepherd-342008 -bg
> nc        2932  0.0  0.0  5684 1200 ?        Ss   17:47   0:00  |   \_
> /bin/sh ..spoolpath../job_scripts/342008
> sge       2933  0.0  0.0 27784 2280 ?        S    17:47   0:00  \_
> sge_shepherd-342442 -bg
> nc        2934  0.0  0.0  5684 1200 ?        Ss   17:47   0:00      \_
> /bin/sh ..spoolpath../job_scripts/342442
>
> We can see 4 jobs running here. The first one (341511) is the
> stale one, the others are normal ones. For clarity, I've
> replaced the real (long) paths. The first job should have
> been the same user as the third and the fourth.
>
> I forgot to say that the cluster is running Debian Linux,
> kernel 2.6.20 and that the SGE_ROOT is a NFS share on the
> master.
>
> The jobs that were lost produced no output, neither standard
> nor error.
>
> If someone can help...
>
> Best regards
>
> Norbert
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list