[GE users] Jobs still shown as running after process has died

robhorton r.horton at qmul.ac.uk
Fri Aug 13 09:09:04 BST 2010


Hi,

----- "reuti" <reuti at staff.uni-marburg.de> wrote:

> 
> was this a serial or parallel job? Parallel jobs are known to have a
> delay after they finished.

It's a serial job. In the current example 3 (out of 100) tasks from an array job have "disappeared" on one node. The remaining slot is occupied by a job from a different user which is apparently running normally.

> What do you mean by "the actual process ... has died"? It hangs or
> disappeared from the process list (hence the shepherd is hanging
> around there alone)?

The process and shepherd have disappeared from the process list. When I do a strace -p on the execd process it is regularly looking at the /proc entries for the working processes but doesn't seem to be looking for the other processes, so it looks though the execd know the processes have exited but that information isn't getting back to the qmaster for whatever reason.

I've got a "live" example at the moment if anyone has any debugging suggestions.

This is 6.2u4 as supplied with Rocks 5.3 for what it's worth.

Rob

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=274212

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list