[GE users] grid engine problem

Ravi Chandra Nallan Ravichandra.Nallan at Sun.COM
Tue Nov 13 12:13:16 GMT 2007


    [ The following text is in the "ISO-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Sandeep,
 I assume you have gone through the setup guide esp the sfu part.
Please check if the firewall is blocking the telnet (23) port.
Were you able to install the execd on the windows host, as this requires 
that it is able to contact qmaster (atleast read the config files)?
Can you also check if sge_execd is already running  ?
regards,
~Ravi

Harald Pollinger wrote:
> Can somebody help who is more experienced in GDI problems?
>
> Thanks!
> Harald
>
> Sandeep, Patel(IE10) wrote:
>> Hi Both win and lin both are showing SGE 6.1u2.
>> Than what may be  the issue?
>> Thanks sandeep
>>
>> -----Original Message-----
>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>> Sent: Tuesday, November 13, 2007 4:58 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] grid engine problem
>>
>> Sandeep, Patel(IE10) wrote:
>>> Hi
>>>   I m getting the  output:
>>>
>>> $ telnet gridserver.sunnonegrid-bangalore.com 536
>>> Trying 199.63.61.100...
>>> Connected to gridserver.sunnonegrid-bangalore.com.
>>> Escape character is '^]'.
>>> Connection closed by foreign host.
>>> $
>>>
>>> Than what is the issue?
>>>
>>> And in /tmp folder of windows I m getting messages like:-
>>>
>>> 11/13/2007 15:08:01|execd|ie10dtdc3zl1s|E|can't unpack gdi request
>>> 11/13/2007 15:08:01|execd|ie10dtdc3zl1s|E|error unpacking gdi request:
>>> bad argument
>>> 11/13/2007 15:08:01|execd|ie10dtdc3zl1s|E|getting configuration:
>> failed
>>> receiving gdi request
>>> 11/13/2007 15:08:05|execd|ie10dtdc3zl1s|E|can't unpack gdi request
>>>
>>> What are these?how can I resolve it?
>>
>> I think this means that your Windows execution daemon and your master 
>> daemon are of different versions.
>>
>> Start on the Windows host:
>> # $SGE_ROOT/bin/win32-x86/sge_execd -help
>>
>> and on the QMaster host:
>> # $SGE_ROOT/bin/lx24-x86/sge_qmaster -help
>> (I'm not sure about the lx24, could also be lx26, and the x86 could be a
>>
>> amd64 or ia64, depending on your archtitecture and RHEL-Version)
>>
>> to get their version numbers.
>>
>> Regards,
>> Harald
>>
>>
>>
>>>
>>>
>>> Thanks
>>> sandeep
>>>
>>> -----Original Message-----
>>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>>> Sent: Tuesday, November 13, 2007 4:20 PM
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] grid engine problem
>>>
>>> Sandeep, Patel(IE10) wrote:
>>>> Hi
>>>>      Actually I have one windows system and inside that I have
>>> installed
>>>> vmware. In that I m running two RHEL virtual machines. One of the
>>>> virtual RHEL is my master host other is execution host. And the
>> mother
>>>> window os is one execution host. Is it the problem?
>>> This should not be a problem. Maybe you will have to change some
>>> settings.
>>>
>>>
>>>> By putty software I m able to connect from windows execution host to
>>>> RHEL master host through SSH. But by telnet it is showing some
>> network
>>>> error? How can I fix this?
>>> I think you got me wrong. Try to connect to the qmaster itself, not to
>>
>>> the telnetd of the qmaster host. Use
>>> # telnet gridserver.sunnonegrid-bangalore.com 536
>>>
>>> It should print
>>>
>>> Trying [IP-Adress of gridserver.sunnonegrid-bangalore-com]...
>>> Connected to gridserver.sunnonegrid-bangalore.com.
>>> Escape character is '^]'.
>>>
>>>
>>> Regards,
>>> Harald
>>>
>>>
>>>> Thanks Sandeep
>>>>
>>>> -----Original Message-----
>>>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>>>> Sent: Tuesday, November 13, 2007 2:50 PM
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] grid engine problem
>>>>
>>>> Sandeep, Patel(IE10) wrote:
>>>>> Hi
>>>>>     I checked messages and I got something like this
>>>>>                                       11/13/2007 
>>>>> 12:38:16|execd|ie10dtdc3zl1s|E|commlib error: endpoint is
>>>> not
>>>>> unique error (endpoint
>>> "ie10dtdc3zl1s.global.ds.honeywell.com/execd/1"
>>>>> is already connected)
>>>> Are there more than one "sge_execd" instances running on that host?
>>>> If yes, please kill all and start only one of them again.
>>>>
>>>>
>>>>> 11/13/2007 12:38:16|execd|ie10dtdc3zl1s|E|getting configuration:
>>>> unable
>>>>> to contact qmaster using port 536 on host
>>>>> "gridserver.sunnonegrid-bangalore.com"
>>>> Is there a firewall running somewhere on or between the execution
>> host
>>>> and the master host?
>>>> Is it possible to connect from the execution host to the qmaster
>> using
>>>> telnet?
>>>>
>>>>
>>>> Regards,
>>>> Harald
>>>>
>>>>
>>>>> 11/13/2007 12:38:19|execd|ie10dtdc3zl1s|E|can't get configuration
>>> from
>>>>> qmaster -- backgrounding
>>>>>
>>>>> How to solve this problem
>>>>>
>>>>> Thanks sandeep
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Ravichandra.Nallan at Sun.COM [mailto:Ravichandra.Nallan at Sun.COM]
>>
>>>>> Sent: Tuesday, November 13, 2007 12:25 PM
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: Re: [GE users] grid engine problem
>>>>>
>>>>> Hi Sandeep,
>>>>>  From the qstat o/p it is evident (states au) that the execd on host
>>
>>>>> ie10dtdc3z11s.<something....> is not up. Check if there are any
>>>> problems
>>>>> for the execd not coming up. (check 
>>>>> $SGE_ROOT/$SGE_CELL/spool/<hostname>/messages ).
>>>>> This is the reason why the jobs are not scheduled to this host.
>>>>>
>>>>> (For info on queue states check qstat(1) man page, you could also
>> see
>>>> in
>>>>> qstat that the load_avg/arch is -NA- !! ).
>>>>>
>>>>> Hope this helps.
>>>>> regards,
>>>>> ~Ravi
>>>>>
>>>>> Sandeep, Patel(IE10) wrote:
>>>>>> Hi
>>>>>>
>>>>>> 1. I have my *master *host in RHEL.
>>>>>>
>>>>>> 2. I have two *execution* host
>>>>>>
>>>>>> A. one is on *windows *
>>>>>>
>>>>>> B. other one is on *RHEL*
>>>>>>
>>>>>> 3. When I m submitting the job *simple.sh(4times) , *when I m
>> typing
>>>>>> the command *qstat -f , * then the job is always going to the 
>>>>>> RHEL execution host for execution because the
>>>>>>
>>>>>> Used by/total *is 2/2* for RHEL , but for *windows 0/2.the* jobs
>> are
>>>>>> *pending* for some time and *later taken by* RHEL execution host.
>>>>>>
>>>>>> 4. It means the job is not distributed among the hosts *!!!!*
>>>>>>
>>>>>> 5. How can I solve this?
>>>>>>
>>>>>> 6. In this connection I have *attached* some *screen shots*. Can 
>>>>>> u please check it out?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> sandeep
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list