[GE users] execd doesn't know this job (disappering jobs, 't' problem)
Stephan Grell - Sun Germany - SSG - Software Engineer
stephan.grell at sun.com
Mon Jan 24 09:01:07 GMT 2005
I just worked on a similar problem and was able to bring it down
to a spooling issue on the execd side. The execd spool dir got
changed from NFS spooling to local spooling. This generated an
incomplete set of dirs in the spool dir.
I do not know, if your problem is triggered by the same issue or I
just got lucky not to run into the t-state problem again. The spool dir
issue is number 103.
Is your problem related to 103?
We also found a but in file staging. If the file does not exist, a job
will disapear and an email will be send. Do you use file staging for
Christian Bolliger wrote:
>Sorry for bringing up a problem again. Using SGE 6.0u3, I previously
>thought that the problem was linked to the filehandle problem in 6.0u2.
>Jobs in our Myrinet section tend to disappear in the starting phase
>(seems that also gbit mpich jobs are affected). They will be taken in to
>'t' state and than quit without any output (users call it 't' problem).
>Jobs using more CPUs are more likely to disappear.
>It is not limited to specific exec hosts. It seems to be a kind of
>This problem really hinders production, some users are demanding PBS :( .
>Many thanks for helping
>PS: I will also open an issue, but there might be other users with this
>messages:01/21/2005 18:32:11|qmaster|master1|W|job 42972.1 failed on
>host node0072a.mbit.mh.hpc.unizh.ch in recognising job because: execd
>doesn't know this job
>node0072a.mbit.mh.hpc.unizh.ch reports running state for job
>(42972.1/master) in queue "long-myri.q at node0072a.mbit.mh.hpc.unizh.ch"
>while job is in state 65536
>18:34:14|qmaster|master1|E|execd at node0072a.mbit.mh.hpc.unizh.ch reports
>running job (42972.1/master) in queue
>"long-myri.q at node0072a.mbit.mh.hpc.unizh.ch" that was not supposed to be
>there - killing
>qsub_time Thu Jan 1 01:00:00 1970
>failed 21 : in recognising job
>qacct -j of identical job which run:
>chribo at master1:~/mpich-test> qacct -j 42971
>qsub_time Fri Jan 21 18:26:21 2005
>start_time Fri Jan 21 18:30:00 2005
>end_time Fri Jan 21 18:30:09 2005
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users