[GE users] execd doesn't know this job (disappering jobs, 't' problem)

Christian Bolliger christian.bolliger at id.unizh.ch
Mon Jan 24 10:51:43 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello Stephan
Could be the same problem. I have the execd still on 6.0u2, I will 
update them to 6.0u3 and test again.

Thanks
Christian

Stephan Grell - Sun Germany - SSG - Software Engineer wrote:

>Hello,
>
>I just worked on a similar problem and was able to bring it down
>to a spooling issue on the execd side. The execd spool dir got
>changed from NFS spooling to local spooling. This generated an
>incomplete set of dirs in the spool dir.
>
>I do not know, if your problem is triggered by the same issue or I
>just got lucky not to run into the t-state problem again. The spool dir
>issue is number 103.
>
>Is your problem related to 103?
>
>We also found a but in file staging. If the file does not exist, a job
>will disapear and an email will be send. Do you use file staging for
>your jobs?
>
>Cheers,
>Stephan
>
>Christian Bolliger wrote:
>
>  
>
>>Hello
>>Sorry for bringing up a problem again. Using SGE 6.0u3, I previously 
>>thought that the problem was linked to the filehandle problem in 6.0u2.
>>
>>Jobs in our Myrinet section tend to disappear in the starting phase 
>>(seems that also gbit mpich jobs are affected). They will be taken in to 
>>'t' state and than quit without any output (users call it 't' problem). 
>>Jobs using more CPUs are more likely to disappear.
>>It is not limited to specific exec hosts. It seems to be a kind of 
>>racing condition.
>>
>>This problem really hinders production, some users are demanding PBS :( .
>>
>>Many thanks for helping
>>Christian
>>
>>
>>    
>>

-- 
=============================================================================
Christian Bolliger                 
IT Services                      | http://www.id.unizh.ch/
Central Systems / HPC   	 | http://www.matterhorn.unizh.ch/
University of  Zuerich           | E-Mail: christian.bolliger at id.unizh.ch
Winterthurerstr. 190             | Tel: +41 (0)1 63 56775
CH-8057 Zuerich; Switzerland     | Fax: +41 (0)1 63 54505
Mime/S CA:                https://www.ca.unizh.ch/client/




More information about the gridengine-users mailing list