[GE users] execd doesn't know this job (disappering jobs, 't' problem)

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Mon Jan 24 13:32:44 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Christian Bolliger wrote:

> In that case it is not the same problem as 103 (I overlooked 
> something). It is a statistical problem, it just appears sometimes, 
> more or less equally distributed over all hosts.
> The problem of creating $SGE_ROOT/$SGE_CELL/spool/$hostname has been 
> solved by our installation procedure.
>
That means, that this bug ist still in there. I was hopping, that it is 
just the spool dirs. It looks, as if it is not only that.
Do you have some more information for me?  How is your grid set up? What 
are the operation systems that you
are using? Is this problem operating system specific?

Can you send me:
- your acct file
- qmaster messages file
- one or two execd message files, on which that problem happend?

Thank you very much.
Stephan

> Thanks
> Christian
> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>
>>
>>
>> Christian Bolliger wrote:
>>
>>> Hello Stephan
>>> Could be the same problem. I have the execd still on 6.0u2, I will 
>>> update them to 6.0u3 and test again.
>>>
>>
>> This is not fixed in u3. A simple check of the spool dirs and creating
>> the missing dirs will do it.
>>
>> Cheers,
>> Stephan
>>
>>> Thanks
>>> Christian
>>>
>>> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>>
>>>  
>>>
>>>> Hello,
>>>>
>>>> I just worked on a similar problem and was able to bring it down
>>>> to a spooling issue on the execd side. The execd spool dir got
>>>> changed from NFS spooling to local spooling. This generated an
>>>> incomplete set of dirs in the spool dir.
>>>>
>>>> I do not know, if your problem is triggered by the same issue or I
>>>> just got lucky not to run into the t-state problem again. The spool 
>>>> dir
>>>> issue is number 103.
>>>>
>>>> Is your problem related to 103?
>>>>
>>>> We also found a but in file staging. If the file does not exist, a job
>>>> will disapear and an email will be send. Do you use file staging for
>>>> your jobs?
>>>>
>>>> Cheers,
>>>> Stephan
>>>>
>>>> Christian Bolliger wrote:
>>>>
>>>>
>>>>
>>>>   
>>>>
>>>>> Hello
>>>>> Sorry for bringing up a problem again. Using SGE 6.0u3, I 
>>>>> previously thought that the problem was linked to the filehandle 
>>>>> problem in 6.0u2.
>>>>>
>>>>> Jobs in our Myrinet section tend to disappear in the starting 
>>>>> phase (seems that also gbit mpich jobs are affected). They will be 
>>>>> taken in to 't' state and than quit without any output (users call 
>>>>> it 't' problem). Jobs using more CPUs are more likely to disappear.
>>>>> It is not limited to specific exec hosts. It seems to be a kind of 
>>>>> racing condition.
>>>>>
>>>>> This problem really hinders production, some users are demanding 
>>>>> PBS :( .
>>>>>
>>>>> Many thanks for helping
>>>>> Christian
>>>>>
>>>>>
>>>>>  
>>>>>     
>>>>
>>>
>>>  
>>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list