[GE users] execd doesn't know this job (disappering jobs, 't' problem)

Christian Bolliger christian.bolliger at id.unizh.ch
Mon Jan 24 14:05:15 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Sorry forgot that, it is lx26-amd64 (Suse 9.2).
The binaries were compiled by us.

Christian Bolliger

Stephan Grell - Sun Germany - SSG - Software Engineer wrote:

> Christian Bolliger wrote:
>
>> Hello Stephan
>>
>> Wie have a Cluster with 256 nodes. 128 node Myrinet, 128 node gbit. 
>> The bug is only relevant on Myrinet, it also occurs on the Gbit queue 
>> but there we have mostly sequential jobs.
>> SGE setup:
>> - NFS reduced setup with local executables and local spool but 
>> distrbution configuration (for shadow masters).
>> - Using Berkeley DB as spool mechanism on a distinguished server 
>> (single point of failure).
>
>
> Which operation systems are you using?
> - We did see the problem only lx24-amd64. But that might be, because 
> only those machines
> were used for the large parallel jobs.
>
>>
>> The files will be sent directly, they are to big for the list.
>
>
> Thank you very much. I hope, I will find something.
>
> Stephan
>
>>
>> Christian
>>
>> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>
>>> Christian Bolliger wrote:
>>>
>>>> In that case it is not the same problem as 103 (I overlooked 
>>>> something). It is a statistical problem, it just appears sometimes, 
>>>> more or less equally distributed over all hosts.
>>>> The problem of creating $SGE_ROOT/$SGE_CELL/spool/$hostname has 
>>>> been solved by our installation procedure.
>>>>
>>> That means, that this bug ist still in there. I was hopping, that it 
>>> is just the spool dirs. It looks, as if it is not only that.
>>> Do you have some more information for me?  How is your grid set up? 
>>> What are the operation systems that you
>>> are using? Is this problem operating system specific?
>>>
>>> Can you send me:
>>> - your acct file
>>> - qmaster messages file
>>> - one or two execd message files, on which that problem happend?
>>>
>>> Thank you very much.
>>> Stephan
>>>
>>>> Thanks
>>>> Christian
>>>> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>>>
>>>>>
>>>>>
>>>>> Christian Bolliger wrote:
>>>>>
>>>>>> Hello Stephan
>>>>>> Could be the same problem. I have the execd still on 6.0u2, I 
>>>>>> will update them to 6.0u3 and test again.
>>>>>>
>>>>>
>>>>> This is not fixed in u3. A simple check of the spool dirs and 
>>>>> creating
>>>>> the missing dirs will do it.
>>>>>
>>>>> Cheers,
>>>>> Stephan
>>>>>
>>>>>> Thanks
>>>>>> Christian
>>>>>>
>>>>>> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>>>>>
>>>>>>  
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I just worked on a similar problem and was able to bring it down
>>>>>>> to a spooling issue on the execd side. The execd spool dir got
>>>>>>> changed from NFS spooling to local spooling. This generated an
>>>>>>> incomplete set of dirs in the spool dir.
>>>>>>>
>>>>>>> I do not know, if your problem is triggered by the same issue or I
>>>>>>> just got lucky not to run into the t-state problem again. The 
>>>>>>> spool dir
>>>>>>> issue is number 103.
>>>>>>>
>>>>>>> Is your problem related to 103?
>>>>>>>
>>>>>>> We also found a but in file staging. If the file does not exist, 
>>>>>>> a job
>>>>>>> will disapear and an email will be send. Do you use file staging 
>>>>>>> for
>>>>>>> your jobs?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stephan
>>>>>>>
>>>>>>> Christian Bolliger wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  
>>>>>>>
>>>>>>>> Hello
>>>>>>>> Sorry for bringing up a problem again. Using SGE 6.0u3, I 
>>>>>>>> previously thought that the problem was linked to the 
>>>>>>>> filehandle problem in 6.0u2.
>>>>>>>>
>>>>>>>> Jobs in our Myrinet section tend to disappear in the starting 
>>>>>>>> phase (seems that also gbit mpich jobs are affected). They will 
>>>>>>>> be taken in to 't' state and than quit without any output 
>>>>>>>> (users call it 't' problem). Jobs using more CPUs are more 
>>>>>>>> likely to disappear.
>>>>>>>> It is not limited to specific exec hosts. It seems to be a kind 
>>>>>>>> of racing condition.
>>>>>>>>
>>>>>>>> This problem really hinders production, some users are 
>>>>>>>> demanding PBS :( .
>>>>>>>>
>>>>>>>> Many thanks for helping
>>>>>>>> Christian
>>>>>>>>
>>>>>>>>
>>>>>>>>  
>>>>>>>>     
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>  
>>>>>>
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


-- 
=============================================================================
Christian Bolliger                 
IT Services                      | http://www.id.unizh.ch/
Central Systems / HPC   	 | http://www.matterhorn.unizh.ch/
University of  Zuerich           | E-Mail: christian.bolliger at id.unizh.ch
Winterthurerstr. 190             | Tel: +41 (0)1 63 56775
CH-8057 Zuerich; Switzerland     | Fax: +41 (0)1 63 54505
Mime/S CA:                https://www.ca.unizh.ch/client/




More information about the gridengine-users mailing list