[GE users] failed receiving gdi request

Daniel Templeton Dan.Templeton at Sun.COM
Tue May 29 16:28:53 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Geno,

GDI is the protocol that the qmaster speaks.  The error that you're 
seeing says that the client received a message from the qmaster that it 
could not decipher.  When the Grid Engine communications library sends 
data out, it first has to translate the data into an on-the-wire 
format.  That process is called "packing."  "Unpacking" is the 
opposite.  The error says that the data in the message was garbled in 
such a way that it could not be translated from its on-the-wire format.  
Such problems most often occur with mismatched versions.  (I've 
personally never seen it in any other case.)

Daniel

geno wrote:
> hi,
>
> Reuti schreef:
>> Hi,
>>
>> Am 25.05.2007 um 11:12 schrieb geno:
>>
>>> We freshly set up a GE, version N1GE 6.0u9
>>> qmaster on a Xeon, with 2.6.9-42.ELsmp i686
>>> sgeexecd  on Opteron nodes, with 2.6.9-42.ELsmp x86_64
>>>
>>> Our first jobs seemed to run fine.
>>> Parallel jobs did not run because MPI wasn't (and maybe isn't) set 
>>> up properly.
>>> So we got errors like "cannot run in PE "mpi" because it only offers 
>>> 0 slots"
>>
>> you set the number of slots in the PE definition to a sensible value, 
>> and attached the PE also to a cluster queue of your choice?
> Slots correspond with the total nr of slots.
> Qmon shows mpi as referenced PE for my both queus.
>
> # qconf -sp mpi
> pe_name           mpi
> slots             140
> user_lists        astro1 maphy1
> xuser_lists       NONE
> start_proc_args   /nfsshare/sge-root/mpi/startmpi.sh $pe_hostfile
> stop_proc_args    /nfsshare/sge-root/mpi/stopmpi.sh
> allocation_rule   $fill_up
> control_slaves    FALSE
> job_is_first_task FALSE
> urgency_slots     min
>>
>>> By adding lamboot and lamhalt in the script, and adding some changes 
>>> to the PE environment, these PE related errors disappeared.
>>> Now we got a new error :
>>>    error: can't unpack gdi request
>>>    error: error unpacking gdi request: bad argument
>>>    failed receiving gdi request
>>
>> For a proper LAM/MPI integration, this might help:
>>
>> http://gridengine.sunsource.net/howto/lam-integration/lam-integration.html 
>>
> Thanks. I'll have a closer look at this.
>
>>
>>> In your mailing list archive, this error was related to:
>>> - having different GE versions. we don't.
>>> - having too much in messages in read buffer. we don't (0).
>>>
>>> The gdi error prevents us now from starting new jobs, parallel or not.
>>> I have no idea about what gdi is. Does anyone know what happens ?
>>> geno
>>
>> Can you please check, whether any queues are in status E (error) and 
>> clear it by using qmod?
> One node had status E; I cleared it.
>
> gdi error keeps existing.
>
> geno.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list