[GE users] execd doesn't know this job (disappering jobs, 't' problem)

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Mon Jan 24 14:01:14 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Christian Bolliger wrote:

> Hello Stephan
>
> Wie have a Cluster with 256 nodes. 128 node Myrinet, 128 node gbit. 
> The bug is only relevant on Myrinet, it also occurs on the Gbit queue 
> but there we have mostly sequential jobs.
> SGE setup:
> - NFS reduced setup with local executables and local spool but 
> distrbution configuration (for shadow masters).
> - Using Berkeley DB as spool mechanism on a distinguished server 
> (single point of failure).

Which operation systems are you using?
- We did see the problem only lx24-amd64. But that might be, because 
only those machines
were used for the large parallel jobs.

>
> The files will be sent directly, they are to big for the list.

Thank you very much. I hope, I will find something.

Stephan

>
> Christian
>
> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>
>> Christian Bolliger wrote:
>>
>>> In that case it is not the same problem as 103 (I overlooked 
>>> something). It is a statistical problem, it just appears sometimes, 
>>> more or less equally distributed over all hosts.
>>> The problem of creating $SGE_ROOT/$SGE_CELL/spool/$hostname has been 
>>> solved by our installation procedure.
>>>
>> That means, that this bug ist still in there. I was hopping, that it 
>> is just the spool dirs. It looks, as if it is not only that.
>> Do you have some more information for me?  How is your grid set up? 
>> What are the operation systems that you
>> are using? Is this problem operating system specific?
>>
>> Can you send me:
>> - your acct file
>> - qmaster messages file
>> - one or two execd message files, on which that problem happend?
>>
>> Thank you very much.
>> Stephan
>>
>>> Thanks
>>> Christian
>>> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>>
>>>>
>>>>
>>>> Christian Bolliger wrote:
>>>>
>>>>> Hello Stephan
>>>>> Could be the same problem. I have the execd still on 6.0u2, I will 
>>>>> update them to 6.0u3 and test again.
>>>>>
>>>>
>>>> This is not fixed in u3. A simple check of the spool dirs and creating
>>>> the missing dirs will do it.
>>>>
>>>> Cheers,
>>>> Stephan
>>>>
>>>>> Thanks
>>>>> Christian
>>>>>
>>>>> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>>>>
>>>>>  
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I just worked on a similar problem and was able to bring it down
>>>>>> to a spooling issue on the execd side. The execd spool dir got
>>>>>> changed from NFS spooling to local spooling. This generated an
>>>>>> incomplete set of dirs in the spool dir.
>>>>>>
>>>>>> I do not know, if your problem is triggered by the same issue or I
>>>>>> just got lucky not to run into the t-state problem again. The 
>>>>>> spool dir
>>>>>> issue is number 103.
>>>>>>
>>>>>> Is your problem related to 103?
>>>>>>
>>>>>> We also found a but in file staging. If the file does not exist, 
>>>>>> a job
>>>>>> will disapear and an email will be send. Do you use file staging for
>>>>>> your jobs?
>>>>>>
>>>>>> Cheers,
>>>>>> Stephan
>>>>>>
>>>>>> Christian Bolliger wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>  
>>>>>>
>>>>>>> Hello
>>>>>>> Sorry for bringing up a problem again. Using SGE 6.0u3, I 
>>>>>>> previously thought that the problem was linked to the filehandle 
>>>>>>> problem in 6.0u2.
>>>>>>>
>>>>>>> Jobs in our Myrinet section tend to disappear in the starting 
>>>>>>> phase (seems that also gbit mpich jobs are affected). They will 
>>>>>>> be taken in to 't' state and than quit without any output (users 
>>>>>>> call it 't' problem). Jobs using more CPUs are more likely to 
>>>>>>> disappear.
>>>>>>> It is not limited to specific exec hosts. It seems to be a kind 
>>>>>>> of racing condition.
>>>>>>>
>>>>>>> This problem really hinders production, some users are demanding 
>>>>>>> PBS :( .
>>>>>>>
>>>>>>> Many thanks for helping
>>>>>>> Christian
>>>>>>>
>>>>>>>
>>>>>>>  
>>>>>>>     
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>  
>>>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list