[GE users] Stale finished jobs

Norbert Crettol norbert.crettol at idiap.ch
Tue Dec 4 11:06:43 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello everybody,

It's my first post on this mailing list.

We have a 20 nodes dedicated cluster with SGE 6.1u3
(upgraded on last Monday). Each node is dual proc, each
proc dual core, each node is 16 GB RAM. This gives us
80 slots with 4GB RAM per slot.

We have three queues :
- long_jobs (max 1 slot/node) - quota max 5 jobs/user
- medium_jobs (max 2 slots/node) - quota max 15 jobs/user
- short_jobs (max 4 slots/node) - quota max 40 jobs/user

All this works fine. The new quota feature rocks. The
jobs are fairly shared, the users are happy.

The only problem we have is that some jobs remain in queue
although they have been terminated correctly. When I look
into the node, nothing is running anymore. And I have
to force delete them to remove them from the queues.
This is not related to a node, neither to a type of job. I
personally ran many thousands short jobs, allways the same
binary and had an average of about 1 to 2 stale jobs out
of a thousand. But some people reported a bigger average.

Did someone experiment the same problem ? Is there something
I can do ?

Regards

Norbert Crettol


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list