[GE users] grid engine problem

Sandeep, Patel(IE10) Sandeep.Patel2 at Honeywell.com
Wed Nov 14 05:53:58 GMT 2007


Hi
   I think this problem is due to not proper installation. Below I have
written my steps can u check it out whether these are correct or not :-

     1. I have downloaded the SGE6.1u2
     2. Untar and unzipped it
     3. I installed the master host in RHEL
     4. I nfs mounted the $SGE_ROOT to windows system, it appears as Z
drive in windows.
     5. Than in SFU KORN shell I set the $SGE_ROOT=/dev/fs/Z ,than I
export it.
     6. Than I started the installation of execution host in windows.
     7. In between the installation asked me about the spool directory,
I have created a spool directory in one drive of my windows system?

      Whether these steps are correct? Am I missing something or wrong?
      If these steps are correct why the jobs are not getting executed
in windows execution host? All jobs are pending? What to do? I got mad
by this issue. can anybody help me out here?

Thanks 
Sandeep


-----Original Message-----
From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
Sent: Tuesday, November 13, 2007 7:05 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] grid engine problem

Let's summarize this problem:

The execd messages file tells us that the execd is unable to connect to 
the qmaster. But the temporary messages files in the /tmp folder tell us

that the execd can establish a TCP connection but can't understand the 
GDI-packages it receives via this connection.
The execd writes is messages to the /tmp folder until it has received 
it's execd_spool_dir from the qmaster.

This means: At one point in time, the execd received it's 
execd_spool_dir from the qmaster and wrote it's messages ot this spool
dir!

And now somehow the GDI packets are invalid. One reason I know for this 
is a version conflict between qmaster and execd.


What happens if you run e.g. "qstat" from the Windows host? Does it also

fail with an GDI error?


Regards,
Harald


Sandeep, Patel(IE10) wrote:
> Hi
>  In SFU 3.5 korn shell ,I m typing ps ax then I m getting the output
> like :
> 
>     1221 - 0:00:21 sge_execd
> So it means the execution daemon is running.so now what is the issue?
> 
> Thanks 
> sandeep
> 
> -----Original Message-----
> From: Ravichandra.Nallan at Sun.COM [mailto:Ravichandra.Nallan at Sun.COM] 
> Sent: Tuesday, November 13, 2007 5:43 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] grid engine problem
> 
> Hi Sandeep,
>  I assume you have gone through the setup guide esp the sfu part.
> Please check if the firewall is blocking the telnet (23) port.
> Were you able to install the execd on the windows host, as this
requires
> 
> that it is able to contact qmaster (atleast read the config files)?
> Can you also check if sge_execd is already running  ?
> regards,
> ~Ravi
> 
> Harald Pollinger wrote:
>> Can somebody help who is more experienced in GDI problems?
>>
>> Thanks!
>> Harald
>>
>> Sandeep, Patel(IE10) wrote:
>>> Hi Both win and lin both are showing SGE 6.1u2.
>>> Than what may be  the issue?
>>> Thanks sandeep
>>>
>>> -----Original Message-----
>>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>>> Sent: Tuesday, November 13, 2007 4:58 PM
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] grid engine problem
>>>
>>> Sandeep, Patel(IE10) wrote:
>>>> Hi
>>>>   I m getting the  output:
>>>>
>>>> $ telnet gridserver.sunnonegrid-bangalore.com 536
>>>> Trying 199.63.61.100...
>>>> Connected to gridserver.sunnonegrid-bangalore.com.
>>>> Escape character is '^]'.
>>>> Connection closed by foreign host.
>>>> $
>>>>
>>>> Than what is the issue?
>>>>
>>>> And in /tmp folder of windows I m getting messages like:-
>>>>
>>>> 11/13/2007 15:08:01|execd|ie10dtdc3zl1s|E|can't unpack gdi request
>>>> 11/13/2007 15:08:01|execd|ie10dtdc3zl1s|E|error unpacking gdi
> request:
>>>> bad argument
>>>> 11/13/2007 15:08:01|execd|ie10dtdc3zl1s|E|getting configuration:
>>> failed
>>>> receiving gdi request
>>>> 11/13/2007 15:08:05|execd|ie10dtdc3zl1s|E|can't unpack gdi request
>>>>
>>>> What are these?how can I resolve it?
>>> I think this means that your Windows execution daemon and your
master
> 
>>> daemon are of different versions.
>>>
>>> Start on the Windows host:
>>> # $SGE_ROOT/bin/win32-x86/sge_execd -help
>>>
>>> and on the QMaster host:
>>> # $SGE_ROOT/bin/lx24-x86/sge_qmaster -help
>>> (I'm not sure about the lx24, could also be lx26, and the x86 could
> be a
>>> amd64 or ia64, depending on your archtitecture and RHEL-Version)
>>>
>>> to get their version numbers.
>>>
>>> Regards,
>>> Harald
>>>
>>>
>>>
>>>>
>>>> Thanks
>>>> sandeep
>>>>
>>>> -----Original Message-----
>>>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>>>> Sent: Tuesday, November 13, 2007 4:20 PM
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] grid engine problem
>>>>
>>>> Sandeep, Patel(IE10) wrote:
>>>>> Hi
>>>>>      Actually I have one windows system and inside that I have
>>>> installed
>>>>> vmware. In that I m running two RHEL virtual machines. One of the
>>>>> virtual RHEL is my master host other is execution host. And the
>>> mother
>>>>> window os is one execution host. Is it the problem?
>>>> This should not be a problem. Maybe you will have to change some
>>>> settings.
>>>>
>>>>
>>>>> By putty software I m able to connect from windows execution host
> to
>>>>> RHEL master host through SSH. But by telnet it is showing some
>>> network
>>>>> error? How can I fix this?
>>>> I think you got me wrong. Try to connect to the qmaster itself, not
> to
>>>> the telnetd of the qmaster host. Use
>>>> # telnet gridserver.sunnonegrid-bangalore.com 536
>>>>
>>>> It should print
>>>>
>>>> Trying [IP-Adress of gridserver.sunnonegrid-bangalore-com]...
>>>> Connected to gridserver.sunnonegrid-bangalore.com.
>>>> Escape character is '^]'.
>>>>
>>>>
>>>> Regards,
>>>> Harald
>>>>
>>>>
>>>>> Thanks Sandeep
>>>>>
>>>>> -----Original Message-----
>>>>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>>>>> Sent: Tuesday, November 13, 2007 2:50 PM
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: Re: [GE users] grid engine problem
>>>>>
>>>>> Sandeep, Patel(IE10) wrote:
>>>>>> Hi
>>>>>>     I checked messages and I got something like this
>>>>>>                                       11/13/2007 
>>>>>> 12:38:16|execd|ie10dtdc3zl1s|E|commlib error: endpoint is
>>>>> not
>>>>>> unique error (endpoint
>>>> "ie10dtdc3zl1s.global.ds.honeywell.com/execd/1"
>>>>>> is already connected)
>>>>> Are there more than one "sge_execd" instances running on that
host?
>>>>> If yes, please kill all and start only one of them again.
>>>>>
>>>>>
>>>>>> 11/13/2007 12:38:16|execd|ie10dtdc3zl1s|E|getting configuration:
>>>>> unable
>>>>>> to contact qmaster using port 536 on host
>>>>>> "gridserver.sunnonegrid-bangalore.com"
>>>>> Is there a firewall running somewhere on or between the execution
>>> host
>>>>> and the master host?
>>>>> Is it possible to connect from the execution host to the qmaster
>>> using
>>>>> telnet?
>>>>>
>>>>>
>>>>> Regards,
>>>>> Harald
>>>>>
>>>>>
>>>>>> 11/13/2007 12:38:19|execd|ie10dtdc3zl1s|E|can't get configuration
>>>> from
>>>>>> qmaster -- backgrounding
>>>>>>
>>>>>> How to solve this problem
>>>>>>
>>>>>> Thanks sandeep
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Ravichandra.Nallan at Sun.COM
> [mailto:Ravichandra.Nallan at Sun.COM]
>>>>>> Sent: Tuesday, November 13, 2007 12:25 PM
>>>>>> To: users at gridengine.sunsource.net
>>>>>> Subject: Re: [GE users] grid engine problem
>>>>>>
>>>>>> Hi Sandeep,
>>>>>>  From the qstat o/p it is evident (states au) that the execd on
> host
>>>>>> ie10dtdc3z11s.<something....> is not up. Check if there are any
>>>>> problems
>>>>>> for the execd not coming up. (check 
>>>>>> $SGE_ROOT/$SGE_CELL/spool/<hostname>/messages ).
>>>>>> This is the reason why the jobs are not scheduled to this host.
>>>>>>
>>>>>> (For info on queue states check qstat(1) man page, you could also
>>> see
>>>>> in
>>>>>> qstat that the load_avg/arch is -NA- !! ).
>>>>>>
>>>>>> Hope this helps.
>>>>>> regards,
>>>>>> ~Ravi
>>>>>>
>>>>>> Sandeep, Patel(IE10) wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>> 1. I have my *master *host in RHEL.
>>>>>>>
>>>>>>> 2. I have two *execution* host
>>>>>>>
>>>>>>> A. one is on *windows *
>>>>>>>
>>>>>>> B. other one is on *RHEL*
>>>>>>>
>>>>>>> 3. When I m submitting the job *simple.sh(4times) , *when I m
>>> typing
>>>>>>> the command *qstat -f , * then the job is always going to the 
>>>>>>> RHEL execution host for execution because the
>>>>>>>
>>>>>>> Used by/total *is 2/2* for RHEL , but for *windows 0/2.the* jobs
>>> are
>>>>>>> *pending* for some time and *later taken by* RHEL execution
host.
>>>>>>>
>>>>>>> 4. It means the job is not distributed among the hosts *!!!!*
>>>>>>>
>>>>>>> 5. How can I solve this?
>>>>>>>
>>>>>>> 6. In this connection I have *attached* some *screen shots*. Can

>>>>>>> u please check it out?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> sandeep
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


-- 
Sun Microsystems GmbH         Harald Pollinger
Dr.-Leo-Ritter-Str. 7         N1 Grid Engine Engineering
D-93049 Regensburg            Phone: +49 (0)941 3075-209  (x60209)
Germany                       Fax: +49 (0)941 3075-222  (x60222)
http://www.sun.com/gridware
mailto:harald.pollinger at sun.com
Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1,
D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list