[GE users] Jobs still shown as running after process has died

reuti reuti at staff.uni-marburg.de
Fri Aug 13 11:09:44 BST 2010


Am 13.08.2010 um 10:09 schrieb robhorton:

> Hi,
> ----- "reuti" <reuti at staff.uni-marburg.de> wrote:
>> was this a serial or parallel job? Parallel jobs are known to have a
>> delay after they finished.
> It's a serial job. In the current example 3 (out of 100) tasks from an array job have "disappeared" on one node. The remaining slot is occupied by a job from a different user which is apparently running normally.
>> What do you mean by "the actual process ... has died"? It hangs or
>> disappeared from the process list (hence the shepherd is hanging
>> around there alone)?
> The process and shepherd have disappeared from the process list. When I do a strace -p on the execd process it is regularly looking at the /proc entries for the working processes but doesn't seem to be looking for the other processes, so it looks though the execd know the processes have exited but that information isn't getting back to the qmaster for whatever reason.
> I've got a "live" example at the moment if anyone has any debugging suggestions.

- was the $TMPDIR on the node already removed?
- was the job's spool directory removed $SGE_ROOT/default/spool/<exechost>/active_jobs (or is it local like /var/spool/<exechost>/active_jobs, which would be better)?
- the messages file of the qmaster has no entry also? (loglevel info)
- was the email send at the end of the job?
- the nodes "messages" file contains a note about the email?

-- Reuti

> This is 6.2u4 as supplied with Rocks 5.3 for what it's worth.
> Rob
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=274212
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list