[GE users] grid engine problem

Sandeep, Patel(IE10) Sandeep.Patel2 at Honeywell.com
Wed Nov 14 11:20:29 GMT 2007


Hi
   Actually I don't have much idea about SFU .can u pls.... tell me how
to automount my $SGE_ROOT to windows SFU /opt/sge.

Thanks
sandeep

-----Original Message-----
From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
Sent: Wednesday, November 14, 2007 4:14 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] grid engine problem

Sandeep, Patel(IE10) wrote:
> Hi
>    I think this problem is due to not proper installation. Below I
have
> written my steps can u check it out whether these are correct or not
:-
> 
>      1. I have downloaded the SGE6.1u2
>      2. Untar and unzipped it
>      3. I installed the master host in RHEL
>      4. I nfs mounted the $SGE_ROOT to windows system, it appears as Z
> drive in windows.
>      5. Than in SFU KORN shell I set the $SGE_ROOT=/dev/fs/Z ,than I
> export it.

I would use the automounter of SFU, it's automatically mounted during 
boot time. The drive mappings of Windows are dangerous - it could be 
that it only exists when you log in, but not when the execd starts.
All available network shares are automounted to
"/net/<hostname>/<share>".


>      6. Than I started the installation of execution host in windows.
>      7. In between the installation asked me about the spool
directory,
> I have created a spool directory in one drive of my windows system?

That's correct, the spool directory has to be on a local drive.


>       Whether these steps are correct? Am I missing something or
wrong?
>       If these steps are correct why the jobs are not getting executed
> in windows execution host? All jobs are pending? What to do? I got mad
> by this issue. can anybody help me out here?

As we already discovered, the problem is that the sge_execd doesn't 
understand the data packages it receives from the sge_qmaster.

To dig deeper into this, you could enable tracing. Kill the running 
execd and enter these commands in the ksh:

# su ie10dtdc3zl1s+Administrator
# . /<your SGE_ROOT_PATH>/<your SGE_CELL>/common/settings.sh
# . $SGE_ROOT/util/dl.sh
# dl 4
# $SGE_ROOT/bin/win32-x86/sge_execd

Now the execd should print a lot of informations to the terminal window.
Stop it by pressing Ctrl+C. Now start it again but redirect all output 
to a file:
# $SGE_ROOT/bin/win32-x86/sge_execd 2> output.txt

And then, please send me this file.

Regards,
Harald


> 
> Thanks 
> Sandeep
> 
> 
> -----Original Message-----
> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
> Sent: Tuesday, November 13, 2007 7:05 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] grid engine problem
> 
> Let's summarize this problem:
> 
> The execd messages file tells us that the execd is unable to connect
to 
> the qmaster. But the temporary messages files in the /tmp folder tell
us
> 
> that the execd can establish a TCP connection but can't understand the

> GDI-packages it receives via this connection.
> The execd writes is messages to the /tmp folder until it has received 
> it's execd_spool_dir from the qmaster.
> 
> This means: At one point in time, the execd received it's 
> execd_spool_dir from the qmaster and wrote it's messages ot this spool
> dir!
> 
> And now somehow the GDI packets are invalid. One reason I know for
this 
> is a version conflict between qmaster and execd.
> 
> 
> What happens if you run e.g. "qstat" from the Windows host? Does it
also
> 
> fail with an GDI error?
> 
> 
> Regards,
> Harald
> 
> 
> Sandeep, Patel(IE10) wrote:
>> Hi
>>  In SFU 3.5 korn shell ,I m typing ps ax then I m getting the output
>> like :
>>
>>     1221 - 0:00:21 sge_execd
>> So it means the execution daemon is running.so now what is the issue?
>>
>> Thanks 
>> sandeep
>>
>> -----Original Message-----
>> From: Ravichandra.Nallan at Sun.COM [mailto:Ravichandra.Nallan at Sun.COM] 
>> Sent: Tuesday, November 13, 2007 5:43 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] grid engine problem
>>
>> Hi Sandeep,
>>  I assume you have gone through the setup guide esp the sfu part.
>> Please check if the firewall is blocking the telnet (23) port.
>> Were you able to install the execd on the windows host, as this
> requires
>> that it is able to contact qmaster (atleast read the config files)?
>> Can you also check if sge_execd is already running  ?
>> regards,
>> ~Ravi
>>
>> Harald Pollinger wrote:
>>> Can somebody help who is more experienced in GDI problems?
>>>
>>> Thanks!
>>> Harald
>>>
>>> Sandeep, Patel(IE10) wrote:
>>>> Hi Both win and lin both are showing SGE 6.1u2.
>>>> Than what may be  the issue?
>>>> Thanks sandeep
>>>>
>>>> -----Original Message-----
>>>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>>>> Sent: Tuesday, November 13, 2007 4:58 PM
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] grid engine problem
>>>>
>>>> Sandeep, Patel(IE10) wrote:
>>>>> Hi
>>>>>   I m getting the  output:
>>>>>
>>>>> $ telnet gridserver.sunnonegrid-bangalore.com 536
>>>>> Trying 199.63.61.100...
>>>>> Connected to gridserver.sunnonegrid-bangalore.com.
>>>>> Escape character is '^]'.
>>>>> Connection closed by foreign host.
>>>>> $
>>>>>
>>>>> Than what is the issue?
>>>>>
>>>>> And in /tmp folder of windows I m getting messages like:-
>>>>>
>>>>> 11/13/2007 15:08:01|execd|ie10dtdc3zl1s|E|can't unpack gdi request
>>>>> 11/13/2007 15:08:01|execd|ie10dtdc3zl1s|E|error unpacking gdi
>> request:
>>>>> bad argument
>>>>> 11/13/2007 15:08:01|execd|ie10dtdc3zl1s|E|getting configuration:
>>>> failed
>>>>> receiving gdi request
>>>>> 11/13/2007 15:08:05|execd|ie10dtdc3zl1s|E|can't unpack gdi request
>>>>>
>>>>> What are these?how can I resolve it?
>>>> I think this means that your Windows execution daemon and your
> master
>>>> daemon are of different versions.
>>>>
>>>> Start on the Windows host:
>>>> # $SGE_ROOT/bin/win32-x86/sge_execd -help
>>>>
>>>> and on the QMaster host:
>>>> # $SGE_ROOT/bin/lx24-x86/sge_qmaster -help
>>>> (I'm not sure about the lx24, could also be lx26, and the x86 could
>> be a
>>>> amd64 or ia64, depending on your archtitecture and RHEL-Version)
>>>>
>>>> to get their version numbers.
>>>>
>>>> Regards,
>>>> Harald
>>>>
>>>>
>>>>
>>>>> Thanks
>>>>> sandeep
>>>>>
>>>>> -----Original Message-----
>>>>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>>>>> Sent: Tuesday, November 13, 2007 4:20 PM
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: Re: [GE users] grid engine problem
>>>>>
>>>>> Sandeep, Patel(IE10) wrote:
>>>>>> Hi
>>>>>>      Actually I have one windows system and inside that I have
>>>>> installed
>>>>>> vmware. In that I m running two RHEL virtual machines. One of the
>>>>>> virtual RHEL is my master host other is execution host. And the
>>>> mother
>>>>>> window os is one execution host. Is it the problem?
>>>>> This should not be a problem. Maybe you will have to change some
>>>>> settings.
>>>>>
>>>>>
>>>>>> By putty software I m able to connect from windows execution host
>> to
>>>>>> RHEL master host through SSH. But by telnet it is showing some
>>>> network
>>>>>> error? How can I fix this?
>>>>> I think you got me wrong. Try to connect to the qmaster itself,
not
>> to
>>>>> the telnetd of the qmaster host. Use
>>>>> # telnet gridserver.sunnonegrid-bangalore.com 536
>>>>>
>>>>> It should print
>>>>>
>>>>> Trying [IP-Adress of gridserver.sunnonegrid-bangalore-com]...
>>>>> Connected to gridserver.sunnonegrid-bangalore.com.
>>>>> Escape character is '^]'.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Harald
>>>>>
>>>>>
>>>>>> Thanks Sandeep
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>>>>>> Sent: Tuesday, November 13, 2007 2:50 PM
>>>>>> To: users at gridengine.sunsource.net
>>>>>> Subject: Re: [GE users] grid engine problem
>>>>>>
>>>>>> Sandeep, Patel(IE10) wrote:
>>>>>>> Hi
>>>>>>>     I checked messages and I got something like this
>>>>>>>                                       11/13/2007 
>>>>>>> 12:38:16|execd|ie10dtdc3zl1s|E|commlib error: endpoint is
>>>>>> not
>>>>>>> unique error (endpoint
>>>>> "ie10dtdc3zl1s.global.ds.honeywell.com/execd/1"
>>>>>>> is already connected)
>>>>>> Are there more than one "sge_execd" instances running on that
> host?
>>>>>> If yes, please kill all and start only one of them again.
>>>>>>
>>>>>>
>>>>>>> 11/13/2007 12:38:16|execd|ie10dtdc3zl1s|E|getting configuration:
>>>>>> unable
>>>>>>> to contact qmaster using port 536 on host
>>>>>>> "gridserver.sunnonegrid-bangalore.com"
>>>>>> Is there a firewall running somewhere on or between the execution
>>>> host
>>>>>> and the master host?
>>>>>> Is it possible to connect from the execution host to the qmaster
>>>> using
>>>>>> telnet?
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Harald
>>>>>>
>>>>>>
>>>>>>> 11/13/2007 12:38:19|execd|ie10dtdc3zl1s|E|can't get
configuration
>>>>> from
>>>>>>> qmaster -- backgrounding
>>>>>>>
>>>>>>> How to solve this problem
>>>>>>>
>>>>>>> Thanks sandeep
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Ravichandra.Nallan at Sun.COM
>> [mailto:Ravichandra.Nallan at Sun.COM]
>>>>>>> Sent: Tuesday, November 13, 2007 12:25 PM
>>>>>>> To: users at gridengine.sunsource.net
>>>>>>> Subject: Re: [GE users] grid engine problem
>>>>>>>
>>>>>>> Hi Sandeep,
>>>>>>>  From the qstat o/p it is evident (states au) that the execd on
>> host
>>>>>>> ie10dtdc3z11s.<something....> is not up. Check if there are any
>>>>>> problems
>>>>>>> for the execd not coming up. (check 
>>>>>>> $SGE_ROOT/$SGE_CELL/spool/<hostname>/messages ).
>>>>>>> This is the reason why the jobs are not scheduled to this host.
>>>>>>>
>>>>>>> (For info on queue states check qstat(1) man page, you could
also
>>>> see
>>>>>> in
>>>>>>> qstat that the load_avg/arch is -NA- !! ).
>>>>>>>
>>>>>>> Hope this helps.
>>>>>>> regards,
>>>>>>> ~Ravi
>>>>>>>
>>>>>>> Sandeep, Patel(IE10) wrote:
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> 1. I have my *master *host in RHEL.
>>>>>>>>
>>>>>>>> 2. I have two *execution* host
>>>>>>>>
>>>>>>>> A. one is on *windows *
>>>>>>>>
>>>>>>>> B. other one is on *RHEL*
>>>>>>>>
>>>>>>>> 3. When I m submitting the job *simple.sh(4times) , *when I m
>>>> typing
>>>>>>>> the command *qstat -f , * then the job is always going to the 
>>>>>>>> RHEL execution host for execution because the
>>>>>>>>
>>>>>>>> Used by/total *is 2/2* for RHEL , but for *windows 0/2.the*
jobs
>>>> are
>>>>>>>> *pending* for some time and *later taken by* RHEL execution
> host.
>>>>>>>> 4. It means the job is not distributed among the hosts *!!!!*
>>>>>>>>
>>>>>>>> 5. How can I solve this?
>>>>>>>>
>>>>>>>> 6. In this connection I have *attached* some *screen shots*.
Can
> 
>>>>>>>> u please check it out?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> sandeep
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 
> 


-- 
Sun Microsystems GmbH         Harald Pollinger
Dr.-Leo-Ritter-Str. 7         N1 Grid Engine Engineering
D-93049 Regensburg            Phone: +49 (0)941 3075-209  (x60209)
Germany                       Fax: +49 (0)941 3075-222  (x60222)
http://www.sun.com/gridware
mailto:harald.pollinger at sun.com
Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1,
D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list