[GE users] execd doesn't know this job (disappering jobs, 't' problem)

Christian Bolliger christian.bolliger at id.unizh.ch
Mon Jan 24 13:54:32 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello Stephan

Wie have a Cluster with 256 nodes. 128 node Myrinet, 128 node gbit. The 
bug is only relevant on Myrinet, it also occurs on the Gbit queue but 
there we have mostly sequential jobs.
SGE setup:
- NFS reduced setup with local executables and local spool but 
distrbution configuration (for shadow masters).
- Using Berkeley DB as spool mechanism on a distinguished server (single 
point of failure).

The files will be sent directly, they are to big for the list.

Christian

Stephan Grell - Sun Germany - SSG - Software Engineer wrote:

> Christian Bolliger wrote:
>
>> In that case it is not the same problem as 103 (I overlooked 
>> something). It is a statistical problem, it just appears sometimes, 
>> more or less equally distributed over all hosts.
>> The problem of creating $SGE_ROOT/$SGE_CELL/spool/$hostname has been 
>> solved by our installation procedure.
>>
> That means, that this bug ist still in there. I was hopping, that it 
> is just the spool dirs. It looks, as if it is not only that.
> Do you have some more information for me?  How is your grid set up? 
> What are the operation systems that you
> are using? Is this problem operating system specific?
>
> Can you send me:
> - your acct file
> - qmaster messages file
> - one or two execd message files, on which that problem happend?
>
> Thank you very much.
> Stephan
>
>> Thanks
>> Christian
>> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>
>>>
>>>
>>> Christian Bolliger wrote:
>>>
>>>> Hello Stephan
>>>> Could be the same problem. I have the execd still on 6.0u2, I will 
>>>> update them to 6.0u3 and test again.
>>>>
>>>
>>> This is not fixed in u3. A simple check of the spool dirs and creating
>>> the missing dirs will do it.
>>>
>>> Cheers,
>>> Stephan
>>>
>>>> Thanks
>>>> Christian
>>>>
>>>> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>>>
>>>>  
>>>>
>>>>> Hello,
>>>>>
>>>>> I just worked on a similar problem and was able to bring it down
>>>>> to a spooling issue on the execd side. The execd spool dir got
>>>>> changed from NFS spooling to local spooling. This generated an
>>>>> incomplete set of dirs in the spool dir.
>>>>>
>>>>> I do not know, if your problem is triggered by the same issue or I
>>>>> just got lucky not to run into the t-state problem again. The 
>>>>> spool dir
>>>>> issue is number 103.
>>>>>
>>>>> Is your problem related to 103?
>>>>>
>>>>> We also found a but in file staging. If the file does not exist, a 
>>>>> job
>>>>> will disapear and an email will be send. Do you use file staging for
>>>>> your jobs?
>>>>>
>>>>> Cheers,
>>>>> Stephan
>>>>>
>>>>> Christian Bolliger wrote:
>>>>>
>>>>>
>>>>>
>>>>>  
>>>>>
>>>>>> Hello
>>>>>> Sorry for bringing up a problem again. Using SGE 6.0u3, I 
>>>>>> previously thought that the problem was linked to the filehandle 
>>>>>> problem in 6.0u2.
>>>>>>
>>>>>> Jobs in our Myrinet section tend to disappear in the starting 
>>>>>> phase (seems that also gbit mpich jobs are affected). They will 
>>>>>> be taken in to 't' state and than quit without any output (users 
>>>>>> call it 't' problem). Jobs using more CPUs are more likely to 
>>>>>> disappear.
>>>>>> It is not limited to specific exec hosts. It seems to be a kind 
>>>>>> of racing condition.
>>>>>>
>>>>>> This problem really hinders production, some users are demanding 
>>>>>> PBS :( .
>>>>>>
>>>>>> Many thanks for helping
>>>>>> Christian
>>>>>>
>>>>>>
>>>>>>  
>>>>>>     
>>>>>
>>>>>
>>>>
>>>>  
>>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


-- 
=============================================================================
Christian Bolliger                 
IT Services                      | http://www.id.unizh.ch/
Central Systems / HPC   	 | http://www.matterhorn.unizh.ch/
University of  Zuerich           | E-Mail: christian.bolliger at id.unizh.ch
Winterthurerstr. 190             | Tel: +41 (0)1 63 56775
CH-8057 Zuerich; Switzerland     | Fax: +41 (0)1 63 54505
Mime/S CA:                https://www.ca.unizh.ch/client/




More information about the gridengine-users mailing list