[GE users] execd doesn't know this job (disappering jobs, 't' problem)

Christian Bolliger christian.bolliger at id.unizh.ch
Mon Jan 24 14:43:10 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

As it seems it came from u2. But u1 was only in test use with a limit  
number of nodes.

Stephan Grell - Sun Germany - SSG - Software Engineer wrote:

> Did you have the problems right from the beginning or did it come with
> u2? Did you use u1?
>
>
> Have you configured the admin email? When ever something improtant
> goes wrong in the execd, it tries to send an email.
>
> Stephan
>
> Christian Bolliger wrote:
>
>> Sorry forgot that, it is lx26-amd64 (Suse 9.2).
>> The binaries were compiled by us.
>
>
> Hm.. now I know of 2 grids using lx2?-amd64, with this problem...
>
>>
>> Christian Bolliger
>>
>> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>
>>> Christian Bolliger wrote:
>>>
>>>> Hello Stephan
>>>>
>>>> Wie have a Cluster with 256 nodes. 128 node Myrinet, 128 node gbit. 
>>>> The bug is only relevant on Myrinet, it also occurs on the Gbit 
>>>> queue but there we have mostly sequential jobs.
>>>> SGE setup:
>>>> - NFS reduced setup with local executables and local spool but 
>>>> distrbution configuration (for shadow masters).
>>>> - Using Berkeley DB as spool mechanism on a distinguished server 
>>>> (single point of failure).
>>>
>>>
>>>
>>>
>>> Which operation systems are you using?
>>> - We did see the problem only lx24-amd64. But that might be, because 
>>> only those machines
>>> were used for the large parallel jobs.
>>>
>>>>
>>>> The files will be sent directly, they are to big for the list.
>>>
>>>
>>>
>>>
>>> Thank you very much. I hope, I will find something.
>>>
>>> Stephan
>>>
>>>>
>>>> Christian
>>>>
>>>> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>>>
>>>>> Christian Bolliger wrote:
>>>>>
>>>>>> In that case it is not the same problem as 103 (I overlooked 
>>>>>> something). It is a statistical problem, it just appears 
>>>>>> sometimes, more or less equally distributed over all hosts.
>>>>>> The problem of creating $SGE_ROOT/$SGE_CELL/spool/$hostname has 
>>>>>> been solved by our installation procedure.
>>>>>>
>>>>> That means, that this bug ist still in there. I was hopping, that 
>>>>> it is just the spool dirs. It looks, as if it is not only that.
>>>>> Do you have some more information for me?  How is your grid set 
>>>>> up? What are the operation systems that you
>>>>> are using? Is this problem operating system specific?
>>>>>
>>>>> Can you send me:
>>>>> - your acct file
>>>>> - qmaster messages file
>>>>> - one or two execd message files, on which that problem happend?
>>>>>
>>>>> Thank you very much.
>>>>> Stephan
>>>>>
>>>>>> Thanks
>>>>>> Christian
>>>>>> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Christian Bolliger wrote:
>>>>>>>
>>>>>>>> Hello Stephan
>>>>>>>> Could be the same problem. I have the execd still on 6.0u2, I 
>>>>>>>> will update them to 6.0u3 and test again.
>>>>>>>>
>>>>>>>
>>>>>>> This is not fixed in u3. A simple check of the spool dirs and 
>>>>>>> creating
>>>>>>> the missing dirs will do it.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stephan
>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Christian
>>>>>>>>
>>>>>>>> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>>>>>>>
>>>>>>>>  
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I just worked on a similar problem and was able to bring it down
>>>>>>>>> to a spooling issue on the execd side. The execd spool dir got
>>>>>>>>> changed from NFS spooling to local spooling. This generated an
>>>>>>>>> incomplete set of dirs in the spool dir.
>>>>>>>>>
>>>>>>>>> I do not know, if your problem is triggered by the same issue 
>>>>>>>>> or I
>>>>>>>>> just got lucky not to run into the t-state problem again. The 
>>>>>>>>> spool dir
>>>>>>>>> issue is number 103.
>>>>>>>>>
>>>>>>>>> Is your problem related to 103?
>>>>>>>>>
>>>>>>>>> We also found a but in file staging. If the file does not 
>>>>>>>>> exist, a job
>>>>>>>>> will disapear and an email will be send. Do you use file 
>>>>>>>>> staging for
>>>>>>>>> your jobs?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Stephan
>>>>>>>>>
>>>>>>>>> Christian Bolliger wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  
>>>>>>>>>
>>>>>>>>>> Hello
>>>>>>>>>> Sorry for bringing up a problem again. Using SGE 6.0u3, I 
>>>>>>>>>> previously thought that the problem was linked to the 
>>>>>>>>>> filehandle problem in 6.0u2.
>>>>>>>>>>
>>>>>>>>>> Jobs in our Myrinet section tend to disappear in the starting 
>>>>>>>>>> phase (seems that also gbit mpich jobs are affected). They 
>>>>>>>>>> will be taken in to 't' state and than quit without any 
>>>>>>>>>> output (users call it 't' problem). Jobs using more CPUs are 
>>>>>>>>>> more likely to disappear.
>>>>>>>>>> It is not limited to specific exec hosts. It seems to be a 
>>>>>>>>>> kind of racing condition.
>>>>>>>>>>
>>>>>>>>>> This problem really hinders production, some users are 
>>>>>>>>>> demanding PBS :( .
>>>>>>>>>>
>>>>>>>>>> Many thanks for helping
>>>>>>>>>> Christian
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  
>>>>>>>>>>     
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>  
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


-- 
=============================================================================
Christian Bolliger                 
IT Services                      | http://www.id.unizh.ch/
Central Systems / HPC   	 | http://www.matterhorn.unizh.ch/
University of  Zuerich           | E-Mail: christian.bolliger at id.unizh.ch
Winterthurerstr. 190             | Tel: +41 (0)1 63 56775
CH-8057 Zuerich; Switzerland     | Fax: +41 (0)1 63 54505
Mime/S CA:                https://www.ca.unizh.ch/client/




More information about the gridengine-users mailing list