[GE users] Stale finished jobs

Norbert Crettol norbert.crettol at idiap.ch
Tue Dec 4 17:07:52 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Norbert Crettol wrote:
> The only problem we have is that some jobs remain in queue
> although they have been terminated correctly. When I look
> into the node, nothing is running anymore. And I have
> to force delete them to remove them from the queues.
> This is not related to a node, neither to a type of job. I
> personally ran many thousands short jobs, allways the same
> binary and had an average of about 1 to 2 stale jobs out
> of a thousand. But some people reported a bigger average.
I have made more tests and have more precisions :
- when I get a stale job in a queue, on the node I can see the
  shepherd (as user sge) running another shepherd (as root)
  instead of  running the job script.

This is what I get on the node :
beo-02:~# ps auxwf | grep s[g]e
sge       3247  0.1  0.0 28368 2892 ?        S    Nov30   9:54 
..binpath../sge_execd
sge       1433  0.0  0.0 27776 2276 ?        S    17:11   0:00  \_ 
sge_shepherd-341511 -bg
root      1437  0.0  0.0 27840 1720 ?        Ss   17:11   0:00  |   \_ 
sge_shepherd-341511 -bg
sge       2684  0.0  0.0 27780 2280 ?        S    17:43   0:00  \_ 
sge_shepherd-341951 -bg
aradilla  2685  0.0  0.0  5692 1200 ?        Ss   17:43   0:00  |   \_ 
/bin/sh ..spoolpath../job_scripts/341951
sge       2931  0.0  0.0 27776 2276 ?        S    17:47   0:00  \_ 
sge_shepherd-342008 -bg
nc        2932  0.0  0.0  5684 1200 ?        Ss   17:47   0:00  |   \_ 
/bin/sh ..spoolpath../job_scripts/342008
sge       2933  0.0  0.0 27784 2280 ?        S    17:47   0:00  \_ 
sge_shepherd-342442 -bg
nc        2934  0.0  0.0  5684 1200 ?        Ss   17:47   0:00      \_ 
/bin/sh ..spoolpath../job_scripts/342442

We can see 4 jobs running here. The first one (341511) is the
stale one, the others are normal ones. For clarity, I've
replaced the real (long) paths. The first job should have
been the same user as the third and the fourth.

I forgot to say that the cluster is running Debian Linux,
kernel 2.6.20 and that the SGE_ROOT is a NFS share on the
master.

The jobs that were lost produced no output, neither standard
nor error.

If someone can help...

Best regards

Norbert

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list