[GE users] failed receiving gdi request

geno ithildin at teomech.ugent.be
Tue Jun 5 13:34:25 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


The gdi error was finally caused by a non-SGE issue.
One user was creating huge files - which made the raid controller instable.
The bug was solved by installing new firmware for the raid controller.

geno

> Daniel,
>
> Thank you for explaining what gdi does.
> We don't have different versions of sge running.
> Except that the master is installed on a 32bit Xeon host and the nodes 
> are AMD 64bit.
>
> Shortly after my mails I noticed a job filling up all disk space and 
> probably causing the master (or one disk or the whole RAID config) to 
> fail.
> So, there might be a link between gdi and failing communication 
> between disks.....
>
> I'll keep you informed when the master is up again.
>
> Geno
>
>
>
> Daniel Templeton schreef:
>> Geno,
>>
>> GDI is the protocol that the qmaster speaks.  The error that you're 
>> seeing says that the client received a message from the qmaster that 
>> it could not decipher.  When the Grid Engine communications library 
>> sends data out, it first has to translate the data into an 
>> on-the-wire format.  That process is called "packing."  "Unpacking" 
>> is the opposite.  The error says that the data in the message was 
>> garbled in such a way that it could not be translated from its 
>> on-the-wire format.  Such problems most often occur with mismatched 
>> versions.  (I've personally never seen it in any other case.)
>>
>> Daniel
>>
>> geno wrote:
>>> hi,
>>>
>>> Reuti schreef:
>>>> Hi,
>>>>
>>>> Am 25.05.2007 um 11:12 schrieb geno:
>>>>
>>>>> We freshly set up a GE, version N1GE 6.0u9
>>>>> qmaster on a Xeon, with 2.6.9-42.ELsmp i686
>>>>> sgeexecd  on Opteron nodes, with 2.6.9-42.ELsmp x86_64
>>>>>
>>>>> Our first jobs seemed to run fine.
>>>>> Parallel jobs did not run because MPI wasn't (and maybe isn't) set 
>>>>> up properly.
>>>>> So we got errors like "cannot run in PE "mpi" because it only 
>>>>> offers 0 slots"
>>>>
>>>> you set the number of slots in the PE definition to a sensible 
>>>> value, and attached the PE also to a cluster queue of your choice?
>>> Slots correspond with the total nr of slots.
>>> Qmon shows mpi as referenced PE for my both queus.
>>>
>>> # qconf -sp mpi
>>> pe_name           mpi
>>> slots             140
>>> user_lists        astro1 maphy1
>>> xuser_lists       NONE
>>> start_proc_args   /nfsshare/sge-root/mpi/startmpi.sh $pe_hostfile
>>> stop_proc_args    /nfsshare/sge-root/mpi/stopmpi.sh
>>> allocation_rule   $fill_up
>>> control_slaves    FALSE
>>> job_is_first_task FALSE
>>> urgency_slots     min
>>>>
>>>>> By adding lamboot and lamhalt in the script, and adding some 
>>>>> changes to the PE environment, these PE related errors disappeared.
>>>>> Now we got a new error :
>>>>>    error: can't unpack gdi request
>>>>>    error: error unpacking gdi request: bad argument
>>>>>    failed receiving gdi request
>>>>
>>>> For a proper LAM/MPI integration, this might help:
>>>>
>>>> http://gridengine.sunsource.net/howto/lam-integration/lam-integration.html 
>>>>
>>> Thanks. I'll have a closer look at this.
>>>
>>>>
>>>>> In your mailing list archive, this error was related to:
>>>>> - having different GE versions. we don't.
>>>>> - having too much in messages in read buffer. we don't (0).
>>>>>
>>>>> The gdi error prevents us now from starting new jobs, parallel or 
>>>>> not.
>>>>> I have no idea about what gdi is. Does anyone know what happens ?
>>>>> geno
>>>>
>>>> Can you please check, whether any queues are in status E (error) 
>>>> and clear it by using qmod?
>>> One node had status E; I cleared it.
>>>
>>> gdi error keeps existing.
>>>
>>> geno.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list