[GE users] execd doesn't know this job (disappering jobs, 't' problem)

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Mon Jan 24 14:40:20 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Did you have the problems right from the beginning or did it come with
u2? Did you use u1?


Have you configured the admin email? When ever something improtant
goes wrong in the execd, it tries to send an email.

Stephan

Christian Bolliger wrote:

> Sorry forgot that, it is lx26-amd64 (Suse 9.2).
> The binaries were compiled by us.

Hm.. now I know of 2 grids using lx2?-amd64, with this problem...

>
> Christian Bolliger
>
> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>
>> Christian Bolliger wrote:
>>
>>> Hello Stephan
>>>
>>> Wie have a Cluster with 256 nodes. 128 node Myrinet, 128 node gbit. 
>>> The bug is only relevant on Myrinet, it also occurs on the Gbit 
>>> queue but there we have mostly sequential jobs.
>>> SGE setup:
>>> - NFS reduced setup with local executables and local spool but 
>>> distrbution configuration (for shadow masters).
>>> - Using Berkeley DB as spool mechanism on a distinguished server 
>>> (single point of failure).
>>
>>
>>
>> Which operation systems are you using?
>> - We did see the problem only lx24-amd64. But that might be, because 
>> only those machines
>> were used for the large parallel jobs.
>>
>>>
>>> The files will be sent directly, they are to big for the list.
>>
>>
>>
>> Thank you very much. I hope, I will find something.
>>
>> Stephan
>>
>>>
>>> Christian
>>>
>>> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>>
>>>> Christian Bolliger wrote:
>>>>
>>>>> In that case it is not the same problem as 103 (I overlooked 
>>>>> something). It is a statistical problem, it just appears 
>>>>> sometimes, more or less equally distributed over all hosts.
>>>>> The problem of creating $SGE_ROOT/$SGE_CELL/spool/$hostname has 
>>>>> been solved by our installation procedure.
>>>>>
>>>> That means, that this bug ist still in there. I was hopping, that 
>>>> it is just the spool dirs. It looks, as if it is not only that.
>>>> Do you have some more information for me?  How is your grid set up? 
>>>> What are the operation systems that you
>>>> are using? Is this problem operating system specific?
>>>>
>>>> Can you send me:
>>>> - your acct file
>>>> - qmaster messages file
>>>> - one or two execd message files, on which that problem happend?
>>>>
>>>> Thank you very much.
>>>> Stephan
>>>>
>>>>> Thanks
>>>>> Christian
>>>>> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Christian Bolliger wrote:
>>>>>>
>>>>>>> Hello Stephan
>>>>>>> Could be the same problem. I have the execd still on 6.0u2, I 
>>>>>>> will update them to 6.0u3 and test again.
>>>>>>>
>>>>>>
>>>>>> This is not fixed in u3. A simple check of the spool dirs and 
>>>>>> creating
>>>>>> the missing dirs will do it.
>>>>>>
>>>>>> Cheers,
>>>>>> Stephan
>>>>>>
>>>>>>> Thanks
>>>>>>> Christian
>>>>>>>
>>>>>>> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>>>>>>
>>>>>>>  
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I just worked on a similar problem and was able to bring it down
>>>>>>>> to a spooling issue on the execd side. The execd spool dir got
>>>>>>>> changed from NFS spooling to local spooling. This generated an
>>>>>>>> incomplete set of dirs in the spool dir.
>>>>>>>>
>>>>>>>> I do not know, if your problem is triggered by the same issue or I
>>>>>>>> just got lucky not to run into the t-state problem again. The 
>>>>>>>> spool dir
>>>>>>>> issue is number 103.
>>>>>>>>
>>>>>>>> Is your problem related to 103?
>>>>>>>>
>>>>>>>> We also found a but in file staging. If the file does not 
>>>>>>>> exist, a job
>>>>>>>> will disapear and an email will be send. Do you use file 
>>>>>>>> staging for
>>>>>>>> your jobs?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Stephan
>>>>>>>>
>>>>>>>> Christian Bolliger wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  
>>>>>>>>
>>>>>>>>> Hello
>>>>>>>>> Sorry for bringing up a problem again. Using SGE 6.0u3, I 
>>>>>>>>> previously thought that the problem was linked to the 
>>>>>>>>> filehandle problem in 6.0u2.
>>>>>>>>>
>>>>>>>>> Jobs in our Myrinet section tend to disappear in the starting 
>>>>>>>>> phase (seems that also gbit mpich jobs are affected). They 
>>>>>>>>> will be taken in to 't' state and than quit without any output 
>>>>>>>>> (users call it 't' problem). Jobs using more CPUs are more 
>>>>>>>>> likely to disappear.
>>>>>>>>> It is not limited to specific exec hosts. It seems to be a 
>>>>>>>>> kind of racing condition.
>>>>>>>>>
>>>>>>>>> This problem really hinders production, some users are 
>>>>>>>>> demanding PBS :( .
>>>>>>>>>
>>>>>>>>> Many thanks for helping
>>>>>>>>> Christian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  
>>>>>>>>>     
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>  
>>>>>>>
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list