[GE users] execd doesn't know this job (disappering jobs, 't' problem)

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Mon Jan 24 11:35:24 GMT 2005



Christian Bolliger wrote:

>Hello Stephan
>Could be the same problem. I have the execd still on 6.0u2, I will 
>update them to 6.0u3 and test again.
>

This is not fixed in u3. A simple check of the spool dirs and creating
the missing dirs will do it.

Cheers,
Stephan

>
>Thanks
>Christian
>
>Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>
>  
>
>>Hello,
>>
>>I just worked on a similar problem and was able to bring it down
>>to a spooling issue on the execd side. The execd spool dir got
>>changed from NFS spooling to local spooling. This generated an
>>incomplete set of dirs in the spool dir.
>>
>>I do not know, if your problem is triggered by the same issue or I
>>just got lucky not to run into the t-state problem again. The spool dir
>>issue is number 103.
>>
>>Is your problem related to 103?
>>
>>We also found a but in file staging. If the file does not exist, a job
>>will disapear and an email will be send. Do you use file staging for
>>your jobs?
>>
>>Cheers,
>>Stephan
>>
>>Christian Bolliger wrote:
>>
>> 
>>
>>    
>>
>>>Hello
>>>Sorry for bringing up a problem again. Using SGE 6.0u3, I previously 
>>>thought that the problem was linked to the filehandle problem in 6.0u2.
>>>
>>>Jobs in our Myrinet section tend to disappear in the starting phase 
>>>(seems that also gbit mpich jobs are affected). They will be taken in to 
>>>'t' state and than quit without any output (users call it 't' problem). 
>>>Jobs using more CPUs are more likely to disappear.
>>>It is not limited to specific exec hosts. It seems to be a kind of 
>>>racing condition.
>>>
>>>This problem really hinders production, some users are demanding PBS :( .
>>>
>>>Many thanks for helping
>>>Christian
>>>
>>>
>>>   
>>>
>>>      
>>>
>
>  
>



More information about the gridengine-users mailing list