[GE users] failed receiving gdi request

geno ithildin at teomech.ugent.be
Fri Jun 1 15:14:36 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Daniel,

Thank you for explaining what gdi does.
We don't have different versions of sge running.
Except that the master is installed on a 32bit Xeon host and the nodes 
are AMD 64bit.

Shortly after my mails I noticed a job filling up all disk space and 
probably causing the master (or one disk or the whole RAID config) to fail.
So, there might be a link between gdi and failing communication between 
disks.....

I'll keep you informed when the master is up again.

Geno



Daniel Templeton schreef:
> Geno,
>
> GDI is the protocol that the qmaster speaks.  The error that you're 
> seeing says that the client received a message from the qmaster that 
> it could not decipher.  When the Grid Engine communications library 
> sends data out, it first has to translate the data into an on-the-wire 
> format.  That process is called "packing."  "Unpacking" is the 
> opposite.  The error says that the data in the message was garbled in 
> such a way that it could not be translated from its on-the-wire 
> format.  Such problems most often occur with mismatched versions.  
> (I've personally never seen it in any other case.)
>
> Daniel
>
> geno wrote:
>> hi,
>>
>> Reuti schreef:
>>> Hi,
>>>
>>> Am 25.05.2007 um 11:12 schrieb geno:
>>>
>>>> We freshly set up a GE, version N1GE 6.0u9
>>>> qmaster on a Xeon, with 2.6.9-42.ELsmp i686
>>>> sgeexecd  on Opteron nodes, with 2.6.9-42.ELsmp x86_64
>>>>
>>>> Our first jobs seemed to run fine.
>>>> Parallel jobs did not run because MPI wasn't (and maybe isn't) set 
>>>> up properly.
>>>> So we got errors like "cannot run in PE "mpi" because it only 
>>>> offers 0 slots"
>>>
>>> you set the number of slots in the PE definition to a sensible 
>>> value, and attached the PE also to a cluster queue of your choice?
>> Slots correspond with the total nr of slots.
>> Qmon shows mpi as referenced PE for my both queus.
>>
>> # qconf -sp mpi
>> pe_name           mpi
>> slots             140
>> user_lists        astro1 maphy1
>> xuser_lists       NONE
>> start_proc_args   /nfsshare/sge-root/mpi/startmpi.sh $pe_hostfile
>> stop_proc_args    /nfsshare/sge-root/mpi/stopmpi.sh
>> allocation_rule   $fill_up
>> control_slaves    FALSE
>> job_is_first_task FALSE
>> urgency_slots     min
>>>
>>>> By adding lamboot and lamhalt in the script, and adding some 
>>>> changes to the PE environment, these PE related errors disappeared.
>>>> Now we got a new error :
>>>>    error: can't unpack gdi request
>>>>    error: error unpacking gdi request: bad argument
>>>>    failed receiving gdi request
>>>
>>> For a proper LAM/MPI integration, this might help:
>>>
>>> http://gridengine.sunsource.net/howto/lam-integration/lam-integration.html 
>>>
>> Thanks. I'll have a closer look at this.
>>
>>>
>>>> In your mailing list archive, this error was related to:
>>>> - having different GE versions. we don't.
>>>> - having too much in messages in read buffer. we don't (0).
>>>>
>>>> The gdi error prevents us now from starting new jobs, parallel or not.
>>>> I have no idea about what gdi is. Does anyone know what happens ?
>>>> geno
>>>
>>> Can you please check, whether any queues are in status E (error) and 
>>> clear it by using qmod?
>> One node had status E; I cleared it.
>>
>> gdi error keeps existing.
>>
>> geno.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list