[GE users] execd doesn't know this job (disappering jobs, 't' problem)

Christian Bolliger christian.bolliger at id.unizh.ch
Mon Jan 24 11:46:20 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

In that case it is not the same problem as 103 (I overlooked something). 
It is a statistical problem, it just appears sometimes, more or less 
equally distributed over all hosts.
The problem of creating $SGE_ROOT/$SGE_CELL/spool/$hostname has been 
solved by our installation procedure.

Thanks
Christian
Stephan Grell - Sun Germany - SSG - Software Engineer wrote:

>
>
> Christian Bolliger wrote:
>
>>Hello Stephan
>>Could be the same problem. I have the execd still on 6.0u2, I will 
>>update them to 6.0u3 and test again.
>>
>
> This is not fixed in u3. A simple check of the spool dirs and creating
> the missing dirs will do it.
>
> Cheers,
> Stephan
>
>>Thanks
>>Christian
>>
>>Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>
>>  
>>
>>>Hello,
>>>
>>>I just worked on a similar problem and was able to bring it down
>>>to a spooling issue on the execd side. The execd spool dir got
>>>changed from NFS spooling to local spooling. This generated an
>>>incomplete set of dirs in the spool dir.
>>>
>>>I do not know, if your problem is triggered by the same issue or I
>>>just got lucky not to run into the t-state problem again. The spool dir
>>>issue is number 103.
>>>
>>>Is your problem related to 103?
>>>
>>>We also found a but in file staging. If the file does not exist, a job
>>>will disapear and an email will be send. Do you use file staging for
>>>your jobs?
>>>
>>>Cheers,
>>>Stephan
>>>
>>>Christian Bolliger wrote:
>>>
>>> 
>>>
>>>    
>>>
>>>>Hello
>>>>Sorry for bringing up a problem again. Using SGE 6.0u3, I previously 
>>>>thought that the problem was linked to the filehandle problem in 6.0u2.
>>>>
>>>>Jobs in our Myrinet section tend to disappear in the starting phase 
>>>>(seems that also gbit mpich jobs are affected). They will be taken in to 
>>>>'t' state and than quit without any output (users call it 't' problem). 
>>>>Jobs using more CPUs are more likely to disappear.
>>>>It is not limited to specific exec hosts. It seems to be a kind of 
>>>>racing condition.
>>>>
>>>>This problem really hinders production, some users are demanding PBS :( .
>>>>
>>>>Many thanks for helping
>>>>Christian
>>>>
>>>>
>>>>   
>>>>
>>>>      
>>>>
>>
>>  
>>

-- 
=============================================================================
Christian Bolliger                 
IT Services                      | http://www.id.unizh.ch/
Central Systems / HPC   	 | http://www.matterhorn.unizh.ch/
University of  Zuerich           | E-Mail: christian.bolliger at id.unizh.ch
Winterthurerstr. 190             | Tel: +41 (0)1 63 56775
CH-8057 Zuerich; Switzerland     | Fax: +41 (0)1 63 54505
Mime/S CA:                https://www.ca.unizh.ch/client/




More information about the gridengine-users mailing list