[GE users] grid engine problem

Sandeep, Patel(IE10) Sandeep.Patel2 at Honeywell.com
Wed Nov 14 13:54:40 GMT 2007


Hi
 I have set the SFU .now I don't have problem in SFU. I installed the
execution daemon in windows successfully. But when I typed 
$ ps ax 
I m not able to see the execution daemon in process list what is the
issye here?

Thanks 
sandeep

-----Original Message-----
From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
Sent: Wednesday, November 14, 2007 6:33 PM
To: Sandeep, Patel(IE10)
Subject: Re: [GE users] grid engine problem

Sandeep,

I'm sorry, but I wont teach you the very basics of Unix. Please ask some

Unix administrator from your company.

I don't have access to your computers, so I can't simply have a look to 
see what's wrong, and I won't play the game of guessing what's wrong and

getting your "no, that didn't help" answer for days or even weeks.

Please find someone who helps you setting up SFU and your environment 
like on the Linux host (this needs much, much more Unix knowledge than 
Windows knowledge) and read the SGE installation guide 
(http://docs.sun.com/app/docs/doc/820-0697?l=en&q=Grid+Engine+)
- there are some additional informations about setting up SFU in the 
Appendix.

If you have done this and come back with some errors that are really 
related to SGE, I will be happy to help you then.

Regards,
Harald


Sandeep, Patel(IE10) wrote:
> Hi
>    But I have done ln -s . than why it is not getting the file ?
> 
> Thanks
> sandeep
> -----Original Message-----
> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
> Sent: Wednesday, November 14, 2007 6:06 PM
> To: Sandeep, Patel(IE10)
> Subject: Re: [GE users] grid engine problem
> 
> 
> 
> Sandeep, Patel(IE10) wrote:
>> Hi
>>  I executed the command .than I started installation:-
>>
>> cat: cannot open file /opt/sge/default/common/bootstrap : No such
file
>> or directory
> 
> Well, I guess this is the error reason... Grid Engine's execution host

> installation needs this file.
> 
> 
>> ./inst_sge: [: none: unexpected operator/operand
>> The admin user >IE10DTDC3ZL1S+< doesn't match the admin username
>>> IE10DTDC3ZL1S+E402335<
>> in the global cluster configuration
> 
> This is just a consecutive fault of the error above.
> 
> Regards,
> Harald
> 
> 
> 
>> -----Original Message-----
>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>> Sent: Wednesday, November 14, 2007 5:20 PM
>> To: Sandeep, Patel(IE10)
>> Subject: Re: [GE users] grid engine problem
>>
>> Just create an link to it.
>>
>> # ln -s /net/...../opt/sge /opt/sge
>>
>> Regards,
>> Harald
>>
>> Sandeep, Patel(IE10) wrote:
>>> Hi
>>>   I got the result :-
>>>
>>>   $ls
>>>    App home opt/sge.
>>>
>>> Than how to mount these things?
>>>
>>> Thanks 
>>> Sandeep
>>>
>>> -----Original Message-----
>>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>>> Sent: Wednesday, November 14, 2007 4:58 PM
>>> To: Sandeep, Patel(IE10)
>>> Cc: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] grid engine problem
>>>
>>> Automounts are - as the name already says - automatically mounted.
> You
>>> just have to use it. All available shares are automatically mounted
> to
>>> the "/net" directory. The content of the "/net" directory may not be

>>> listable, but it is still there.
>>> Just try
>>> # cd /net/gridserver.sunnonegrid-bangalore.com
>>> # ls
>>>
>>> or
>>>
>>> # cd /net/gridserver
>>> # ls
>>>
>>> (depends if you have to use full qualified names or short names).
>>>
>>> Regards,
>>> Harald
>>>
>>> Sandeep, Patel(IE10) wrote:
>>>> Hi
>>>>    Actually I don't have much idea about SFU .can u pls.... tell me
>>> how
>>>> to automount my $SGE_ROOT to windows SFU /opt/sge.
>>>>
>>>> Thanks
>>>> sandeep
>>>>
>>>> -----Original Message-----
>>>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>>>> Sent: Wednesday, November 14, 2007 4:14 PM
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] grid engine problem
>>>>
>>>> Sandeep, Patel(IE10) wrote:
>>>>> Hi
>>>>>    I think this problem is due to not proper installation. Below I
>>>> have
>>>>> written my steps can u check it out whether these are correct or
> not
>>>> :-
>>>>>      1. I have downloaded the SGE6.1u2
>>>>>      2. Untar and unzipped it
>>>>>      3. I installed the master host in RHEL
>>>>>      4. I nfs mounted the $SGE_ROOT to windows system, it appears
> as
>>> Z
>>>>> drive in windows.
>>>>>      5. Than in SFU KORN shell I set the $SGE_ROOT=/dev/fs/Z ,than
> I
>>>>> export it.
>>>> I would use the automounter of SFU, it's automatically mounted
> during
>>>> boot time. The drive mappings of Windows are dangerous - it could
be
> 
>>>> that it only exists when you log in, but not when the execd starts.
>>>> All available network shares are automounted to
>>>> "/net/<hostname>/<share>".
>>>>
>>>>
>>>>>      6. Than I started the installation of execution host in
>> windows.
>>>>>      7. In between the installation asked me about the spool
>>>> directory,
>>>>> I have created a spool directory in one drive of my windows
system?
>>>> That's correct, the spool directory has to be on a local drive.
>>>>
>>>>
>>>>>       Whether these steps are correct? Am I missing something or
>>>> wrong?
>>>>>       If these steps are correct why the jobs are not getting
>>> executed
>>>>> in windows execution host? All jobs are pending? What to do? I got
>>> mad
>>>>> by this issue. can anybody help me out here?
>>>> As we already discovered, the problem is that the sge_execd doesn't

>>>> understand the data packages it receives from the sge_qmaster.
>>>>
>>>> To dig deeper into this, you could enable tracing. Kill the running

>>>> execd and enter these commands in the ksh:
>>>>
>>>> # su ie10dtdc3zl1s+Administrator
>>>> # . /<your SGE_ROOT_PATH>/<your SGE_CELL>/common/settings.sh
>>>> # . $SGE_ROOT/util/dl.sh
>>>> # dl 4
>>>> # $SGE_ROOT/bin/win32-x86/sge_execd
>>>>
>>>> Now the execd should print a lot of informations to the terminal
>>> window.
>>>> Stop it by pressing Ctrl+C. Now start it again but redirect all
>> output
>>>> to a file:
>>>> # $SGE_ROOT/bin/win32-x86/sge_execd 2> output.txt
>>>>
>>>> And then, please send me this file.
>>>>
>>>> Regards,
>>>> Harald
>>>>
>>>>
>>>>> Thanks 
>>>>> Sandeep
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>>>>> Sent: Tuesday, November 13, 2007 7:05 PM
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: Re: [GE users] grid engine problem
>>>>>
>>>>> Let's summarize this problem:
>>>>>
>>>>> The execd messages file tells us that the execd is unable to
> connect
>>>> to 
>>>>> the qmaster. But the temporary messages files in the /tmp folder
>> tell
>>>> us
>>>>> that the execd can establish a TCP connection but can't understand
>>> the
>>>>> GDI-packages it receives via this connection.
>>>>> The execd writes is messages to the /tmp folder until it has
>> received
>>>>> it's execd_spool_dir from the qmaster.
>>>>>
>>>>> This means: At one point in time, the execd received it's 
>>>>> execd_spool_dir from the qmaster and wrote it's messages ot this
>>> spool
>>>>> dir!
>>>>>
>>>>> And now somehow the GDI packets are invalid. One reason I know for
>>>> this 
>>>>> is a version conflict between qmaster and execd.
>>>>>
>>>>>
>>>>> What happens if you run e.g. "qstat" from the Windows host? Does
it
>>>> also
>>>>> fail with an GDI error?
>>>>>
>>>>>
>>>>> Regards,
>>>>> Harald
>>>>>
>>>>>
>>>>> Sandeep, Patel(IE10) wrote:
>>>>>> Hi
>>>>>>  In SFU 3.5 korn shell ,I m typing ps ax then I m getting the
>> output
>>>>>> like :
>>>>>>
>>>>>>     1221 - 0:00:21 sge_execd
>>>>>> So it means the execution daemon is running.so now what is the
>>> issue?
>>>>>> Thanks 
>>>>>> sandeep
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Ravichandra.Nallan at Sun.COM
>> [mailto:Ravichandra.Nallan at Sun.COM]
>>>>>> Sent: Tuesday, November 13, 2007 5:43 PM
>>>>>> To: users at gridengine.sunsource.net
>>>>>> Subject: Re: [GE users] grid engine problem
>>>>>>
>>>>>> Hi Sandeep,
>>>>>>  I assume you have gone through the setup guide esp the sfu part.
>>>>>> Please check if the firewall is blocking the telnet (23) port.
>>>>>> Were you able to install the execd on the windows host, as this
>>>>> requires
>>>>>> that it is able to contact qmaster (atleast read the config
> files)?
>>>>>> Can you also check if sge_execd is already running  ?
>>>>>> regards,
>>>>>> ~Ravi
>>>>>>
>>>>>> Harald Pollinger wrote:
>>>>>>> Can somebody help who is more experienced in GDI problems?
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Harald
>>>>>>>
>>>>>>> Sandeep, Patel(IE10) wrote:
>>>>>>>> Hi Both win and lin both are showing SGE 6.1u2.
>>>>>>>> Than what may be  the issue?
>>>>>>>> Thanks sandeep
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Harald.Pollinger at Sun.COM
[mailto:Harald.Pollinger at Sun.COM]
> 
>>>>>>>> Sent: Tuesday, November 13, 2007 4:58 PM
>>>>>>>> To: users at gridengine.sunsource.net
>>>>>>>> Subject: Re: [GE users] grid engine problem
>>>>>>>>
>>>>>>>> Sandeep, Patel(IE10) wrote:
>>>>>>>>> Hi
>>>>>>>>>   I m getting the  output:
>>>>>>>>>
>>>>>>>>> $ telnet gridserver.sunnonegrid-bangalore.com 536
>>>>>>>>> Trying 199.63.61.100...
>>>>>>>>> Connected to gridserver.sunnonegrid-bangalore.com.
>>>>>>>>> Escape character is '^]'.
>>>>>>>>> Connection closed by foreign host.
>>>>>>>>> $
>>>>>>>>>
>>>>>>>>> Than what is the issue?
>>>>>>>>>
>>>>>>>>> And in /tmp folder of windows I m getting messages like:-
>>>>>>>>>
>>>>>>>>> 11/13/2007 15:08:01|execd|ie10dtdc3zl1s|E|can't unpack gdi
>>> request
>>>>>>>>> 11/13/2007 15:08:01|execd|ie10dtdc3zl1s|E|error unpacking gdi
>>>>>> request:
>>>>>>>>> bad argument
>>>>>>>>> 11/13/2007 15:08:01|execd|ie10dtdc3zl1s|E|getting
> configuration:
>>>>>>>> failed
>>>>>>>>> receiving gdi request
>>>>>>>>> 11/13/2007 15:08:05|execd|ie10dtdc3zl1s|E|can't unpack gdi
>>> request
>>>>>>>>> What are these?how can I resolve it?
>>>>>>>> I think this means that your Windows execution daemon and your
>>>>> master
>>>>>>>> daemon are of different versions.
>>>>>>>>
>>>>>>>> Start on the Windows host:
>>>>>>>> # $SGE_ROOT/bin/win32-x86/sge_execd -help
>>>>>>>>
>>>>>>>> and on the QMaster host:
>>>>>>>> # $SGE_ROOT/bin/lx24-x86/sge_qmaster -help
>>>>>>>> (I'm not sure about the lx24, could also be lx26, and the x86
>>> could
>>>>>> be a
>>>>>>>> amd64 or ia64, depending on your archtitecture and
RHEL-Version)
>>>>>>>>
>>>>>>>> to get their version numbers.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Harald
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> sandeep
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Harald.Pollinger at Sun.COM
> [mailto:Harald.Pollinger at Sun.COM]
>>>>>>>>> Sent: Tuesday, November 13, 2007 4:20 PM
>>>>>>>>> To: users at gridengine.sunsource.net
>>>>>>>>> Subject: Re: [GE users] grid engine problem
>>>>>>>>>
>>>>>>>>> Sandeep, Patel(IE10) wrote:
>>>>>>>>>> Hi
>>>>>>>>>>      Actually I have one windows system and inside that I
have
>>>>>>>>> installed
>>>>>>>>>> vmware. In that I m running two RHEL virtual machines. One of
>>> the
>>>>>>>>>> virtual RHEL is my master host other is execution host. And
> the
>>>>>>>> mother
>>>>>>>>>> window os is one execution host. Is it the problem?
>>>>>>>>> This should not be a problem. Maybe you will have to change
> some
>>>>>>>>> settings.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> By putty software I m able to connect from windows execution
>>> host
>>>>>> to
>>>>>>>>>> RHEL master host through SSH. But by telnet it is showing
some
>>>>>>>> network
>>>>>>>>>> error? How can I fix this?
>>>>>>>>> I think you got me wrong. Try to connect to the qmaster
itself,
>>>> not
>>>>>> to
>>>>>>>>> the telnetd of the qmaster host. Use
>>>>>>>>> # telnet gridserver.sunnonegrid-bangalore.com 536
>>>>>>>>>
>>>>>>>>> It should print
>>>>>>>>>
>>>>>>>>> Trying [IP-Adress of gridserver.sunnonegrid-bangalore-com]...
>>>>>>>>> Connected to gridserver.sunnonegrid-bangalore.com.
>>>>>>>>> Escape character is '^]'.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Harald
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thanks Sandeep
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Harald.Pollinger at Sun.COM
>> [mailto:Harald.Pollinger at Sun.COM]
>>>>>>>>>> Sent: Tuesday, November 13, 2007 2:50 PM
>>>>>>>>>> To: users at gridengine.sunsource.net
>>>>>>>>>> Subject: Re: [GE users] grid engine problem
>>>>>>>>>>
>>>>>>>>>> Sandeep, Patel(IE10) wrote:
>>>>>>>>>>> Hi
>>>>>>>>>>>     I checked messages and I got something like this
>>>>>>>>>>>                                       11/13/2007 
>>>>>>>>>>> 12:38:16|execd|ie10dtdc3zl1s|E|commlib error: endpoint is
>>>>>>>>>> not
>>>>>>>>>>> unique error (endpoint
>>>>>>>>> "ie10dtdc3zl1s.global.ds.honeywell.com/execd/1"
>>>>>>>>>>> is already connected)
>>>>>>>>>> Are there more than one "sge_execd" instances running on that
>>>>> host?
>>>>>>>>>> If yes, please kill all and start only one of them again.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 11/13/2007 12:38:16|execd|ie10dtdc3zl1s|E|getting
>>> configuration:
>>>>>>>>>> unable
>>>>>>>>>>> to contact qmaster using port 536 on host
>>>>>>>>>>> "gridserver.sunnonegrid-bangalore.com"
>>>>>>>>>> Is there a firewall running somewhere on or between the
>>> execution
>>>>>>>> host
>>>>>>>>>> and the master host?
>>>>>>>>>> Is it possible to connect from the execution host to the
>> qmaster
>>>>>>>> using
>>>>>>>>>> telnet?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Harald
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 11/13/2007 12:38:19|execd|ie10dtdc3zl1s|E|can't get
>>>> configuration
>>>>>>>>> from
>>>>>>>>>>> qmaster -- backgrounding
>>>>>>>>>>>
>>>>>>>>>>> How to solve this problem
>>>>>>>>>>>
>>>>>>>>>>> Thanks sandeep
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Ravichandra.Nallan at Sun.COM
>>>>>> [mailto:Ravichandra.Nallan at Sun.COM]
>>>>>>>>>>> Sent: Tuesday, November 13, 2007 12:25 PM
>>>>>>>>>>> To: users at gridengine.sunsource.net
>>>>>>>>>>> Subject: Re: [GE users] grid engine problem
>>>>>>>>>>>
>>>>>>>>>>> Hi Sandeep,
>>>>>>>>>>>  From the qstat o/p it is evident (states au) that the execd
>> on
>>>>>> host
>>>>>>>>>>> ie10dtdc3z11s.<something....> is not up. Check if there are
>> any
>>>>>>>>>> problems
>>>>>>>>>>> for the execd not coming up. (check 
>>>>>>>>>>> $SGE_ROOT/$SGE_CELL/spool/<hostname>/messages ).
>>>>>>>>>>> This is the reason why the jobs are not scheduled to this
>> host.
>>>>>>>>>>> (For info on queue states check qstat(1) man page, you could
>>>> also
>>>>>>>> see
>>>>>>>>>> in
>>>>>>>>>>> qstat that the load_avg/arch is -NA- !! ).
>>>>>>>>>>>
>>>>>>>>>>> Hope this helps.
>>>>>>>>>>> regards,
>>>>>>>>>>> ~Ravi
>>>>>>>>>>>
>>>>>>>>>>> Sandeep, Patel(IE10) wrote:
>>>>>>>>>>>> Hi
>>>>>>>>>>>>
>>>>>>>>>>>> 1. I have my *master *host in RHEL.
>>>>>>>>>>>>
>>>>>>>>>>>> 2. I have two *execution* host
>>>>>>>>>>>>
>>>>>>>>>>>> A. one is on *windows *
>>>>>>>>>>>>
>>>>>>>>>>>> B. other one is on *RHEL*
>>>>>>>>>>>>
>>>>>>>>>>>> 3. When I m submitting the job *simple.sh(4times) , *when I
> m
>>>>>>>> typing
>>>>>>>>>>>> the command *qstat -f , * then the job is always going to
> the
>>>>>>>>>>>> RHEL execution host for execution because the
>>>>>>>>>>>>
>>>>>>>>>>>> Used by/total *is 2/2* for RHEL , but for *windows 0/2.the*
>>>> jobs
>>>>>>>> are
>>>>>>>>>>>> *pending* for some time and *later taken by* RHEL execution
>>>>> host.
>>>>>>>>>>>> 4. It means the job is not distributed among the hosts
> *!!!!*
>>>>>>>>>>>> 5. How can I solve this?
>>>>>>>>>>>>
>>>>>>>>>>>> 6. In this connection I have *attached* some *screen
shots*.
>>>> Can
>>>>>>>>>>>> u please check it out?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>> sandeep
>>>
---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail:
users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail:
>> users-help at gridengine.sunsource.net
>>>
---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail:
users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail:
>> users-help at gridengine.sunsource.net
>>
> 
> 


-- 
Sun Microsystems GmbH         Harald Pollinger
Dr.-Leo-Ritter-Str. 7         N1 Grid Engine Engineering
D-93049 Regensburg            Phone: +49 (0)941 3075-209  (x60209)
Germany                       Fax: +49 (0)941 3075-222  (x60222)
http://www.sun.com/gridware
mailto:harald.pollinger at sun.com
Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1,
D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list