[GE users] execd doesn't know this job (disappering jobs, 't' problem)
christian.bolliger at id.unizh.ch
Mon Jan 24 10:51:43 GMT 2005
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Could be the same problem. I have the execd still on 6.0u2, I will
update them to 6.0u3 and test again.
Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>I just worked on a similar problem and was able to bring it down
>to a spooling issue on the execd side. The execd spool dir got
>changed from NFS spooling to local spooling. This generated an
>incomplete set of dirs in the spool dir.
>I do not know, if your problem is triggered by the same issue or I
>just got lucky not to run into the t-state problem again. The spool dir
>issue is number 103.
>Is your problem related to 103?
>We also found a but in file staging. If the file does not exist, a job
>will disapear and an email will be send. Do you use file staging for
>Christian Bolliger wrote:
>>Sorry for bringing up a problem again. Using SGE 6.0u3, I previously
>>thought that the problem was linked to the filehandle problem in 6.0u2.
>>Jobs in our Myrinet section tend to disappear in the starting phase
>>(seems that also gbit mpich jobs are affected). They will be taken in to
>>'t' state and than quit without any output (users call it 't' problem).
>>Jobs using more CPUs are more likely to disappear.
>>It is not limited to specific exec hosts. It seems to be a kind of
>>This problem really hinders production, some users are demanding PBS :( .
>>Many thanks for helping
IT Services | http://www.id.unizh.ch/
Central Systems / HPC | http://www.matterhorn.unizh.ch/
University of Zuerich | E-Mail: christian.bolliger at id.unizh.ch
Winterthurerstr. 190 | Tel: +41 (0)1 63 56775
CH-8057 Zuerich; Switzerland | Fax: +41 (0)1 63 54505
Mime/S CA: https://www.ca.unizh.ch/client/
More information about the gridengine-users