[GE users] grid engine problem

Harald Pollinger Harald.Pollinger at Sun.COM
Wed Nov 14 11:27:58 GMT 2007


    [ The following text is in the "ISO-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Automounts are - as the name already says - automatically mounted. You 
just have to use it. All available shares are automatically mounted to 
the "/net" directory. The content of the "/net" directory may not be 
listable, but it is still there.
Just try
# cd /net/gridserver.sunnonegrid-bangalore.com
# ls

or

# cd /net/gridserver
# ls

(depends if you have to use full qualified names or short names).

Regards,
Harald

Sandeep, Patel(IE10) wrote:
> Hi
>    Actually I don't have much idea about SFU .can u pls.... tell me how
> to automount my $SGE_ROOT to windows SFU /opt/sge.
> 
> Thanks
> sandeep
> 
> -----Original Message-----
> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
> Sent: Wednesday, November 14, 2007 4:14 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] grid engine problem
> 
> Sandeep, Patel(IE10) wrote:
>> Hi
>>    I think this problem is due to not proper installation. Below I
> have
>> written my steps can u check it out whether these are correct or not
> :-
>>      1. I have downloaded the SGE6.1u2
>>      2. Untar and unzipped it
>>      3. I installed the master host in RHEL
>>      4. I nfs mounted the $SGE_ROOT to windows system, it appears as Z
>> drive in windows.
>>      5. Than in SFU KORN shell I set the $SGE_ROOT=/dev/fs/Z ,than I
>> export it.
> 
> I would use the automounter of SFU, it's automatically mounted during 
> boot time. The drive mappings of Windows are dangerous - it could be 
> that it only exists when you log in, but not when the execd starts.
> All available network shares are automounted to
> "/net/<hostname>/<share>".
> 
> 
>>      6. Than I started the installation of execution host in windows.
>>      7. In between the installation asked me about the spool
> directory,
>> I have created a spool directory in one drive of my windows system?
> 
> That's correct, the spool directory has to be on a local drive.
> 
> 
>>       Whether these steps are correct? Am I missing something or
> wrong?
>>       If these steps are correct why the jobs are not getting executed
>> in windows execution host? All jobs are pending? What to do? I got mad
>> by this issue. can anybody help me out here?
> 
> As we already discovered, the problem is that the sge_execd doesn't 
> understand the data packages it receives from the sge_qmaster.
> 
> To dig deeper into this, you could enable tracing. Kill the running 
> execd and enter these commands in the ksh:
> 
> # su ie10dtdc3zl1s+Administrator
> # . /<your SGE_ROOT_PATH>/<your SGE_CELL>/common/settings.sh
> # . $SGE_ROOT/util/dl.sh
> # dl 4
> # $SGE_ROOT/bin/win32-x86/sge_execd
> 
> Now the execd should print a lot of informations to the terminal window.
> Stop it by pressing Ctrl+C. Now start it again but redirect all output 
> to a file:
> # $SGE_ROOT/bin/win32-x86/sge_execd 2> output.txt
> 
> And then, please send me this file.
> 
> Regards,
> Harald
> 
> 
>> Thanks 
>> Sandeep
>>
>>
>> -----Original Message-----
>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>> Sent: Tuesday, November 13, 2007 7:05 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] grid engine problem
>>
>> Let's summarize this problem:
>>
>> The execd messages file tells us that the execd is unable to connect
> to 
>> the qmaster. But the temporary messages files in the /tmp folder tell
> us
>> that the execd can establish a TCP connection but can't understand the
> 
>> GDI-packages it receives via this connection.
>> The execd writes is messages to the /tmp folder until it has received 
>> it's execd_spool_dir from the qmaster.
>>
>> This means: At one point in time, the execd received it's 
>> execd_spool_dir from the qmaster and wrote it's messages ot this spool
>> dir!
>>
>> And now somehow the GDI packets are invalid. One reason I know for
> this 
>> is a version conflict between qmaster and execd.
>>
>>
>> What happens if you run e.g. "qstat" from the Windows host? Does it
> also
>> fail with an GDI error?
>>
>>
>> Regards,
>> Harald
>>
>>
>> Sandeep, Patel(IE10) wrote:
>>> Hi
>>>  In SFU 3.5 korn shell ,I m typing ps ax then I m getting the output
>>> like :
>>>
>>>     1221 - 0:00:21 sge_execd
>>> So it means the execution daemon is running.so now what is the issue?
>>>
>>> Thanks 
>>> sandeep
>>>
>>> -----Original Message-----
>>> From: Ravichandra.Nallan at Sun.COM [mailto:Ravichandra.Nallan at Sun.COM] 
>>> Sent: Tuesday, November 13, 2007 5:43 PM
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] grid engine problem
>>>
>>> Hi Sandeep,
>>>  I assume you have gone through the setup guide esp the sfu part.
>>> Please check if the firewall is blocking the telnet (23) port.
>>> Were you able to install the execd on the windows host, as this
>> requires
>>> that it is able to contact qmaster (atleast read the config files)?
>>> Can you also check if sge_execd is already running  ?
>>> regards,
>>> ~Ravi
>>>
>>> Harald Pollinger wrote:
>>>> Can somebody help who is more experienced in GDI problems?
>>>>
>>>> Thanks!
>>>> Harald
>>>>
>>>> Sandeep, Patel(IE10) wrote:
>>>>> Hi Both win and lin both are showing SGE 6.1u2.
>>>>> Than what may be  the issue?
>>>>> Thanks sandeep
>>>>>
>>>>> -----Original Message-----
>>>>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>>>>> Sent: Tuesday, November 13, 2007 4:58 PM
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: Re: [GE users] grid engine problem
>>>>>
>>>>> Sandeep, Patel(IE10) wrote:
>>>>>> Hi
>>>>>>   I m getting the  output:
>>>>>>
>>>>>> $ telnet gridserver.sunnonegrid-bangalore.com 536
>>>>>> Trying 199.63.61.100...
>>>>>> Connected to gridserver.sunnonegrid-bangalore.com.
>>>>>> Escape character is '^]'.
>>>>>> Connection closed by foreign host.
>>>>>> $
>>>>>>
>>>>>> Than what is the issue?
>>>>>>
>>>>>> And in /tmp folder of windows I m getting messages like:-
>>>>>>
>>>>>> 11/13/2007 15:08:01|execd|ie10dtdc3zl1s|E|can't unpack gdi request
>>>>>> 11/13/2007 15:08:01|execd|ie10dtdc3zl1s|E|error unpacking gdi
>>> request:
>>>>>> bad argument
>>>>>> 11/13/2007 15:08:01|execd|ie10dtdc3zl1s|E|getting configuration:
>>>>> failed
>>>>>> receiving gdi request
>>>>>> 11/13/2007 15:08:05|execd|ie10dtdc3zl1s|E|can't unpack gdi request
>>>>>>
>>>>>> What are these?how can I resolve it?
>>>>> I think this means that your Windows execution daemon and your
>> master
>>>>> daemon are of different versions.
>>>>>
>>>>> Start on the Windows host:
>>>>> # $SGE_ROOT/bin/win32-x86/sge_execd -help
>>>>>
>>>>> and on the QMaster host:
>>>>> # $SGE_ROOT/bin/lx24-x86/sge_qmaster -help
>>>>> (I'm not sure about the lx24, could also be lx26, and the x86 could
>>> be a
>>>>> amd64 or ia64, depending on your archtitecture and RHEL-Version)
>>>>>
>>>>> to get their version numbers.
>>>>>
>>>>> Regards,
>>>>> Harald
>>>>>
>>>>>
>>>>>
>>>>>> Thanks
>>>>>> sandeep
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>>>>>> Sent: Tuesday, November 13, 2007 4:20 PM
>>>>>> To: users at gridengine.sunsource.net
>>>>>> Subject: Re: [GE users] grid engine problem
>>>>>>
>>>>>> Sandeep, Patel(IE10) wrote:
>>>>>>> Hi
>>>>>>>      Actually I have one windows system and inside that I have
>>>>>> installed
>>>>>>> vmware. In that I m running two RHEL virtual machines. One of the
>>>>>>> virtual RHEL is my master host other is execution host. And the
>>>>> mother
>>>>>>> window os is one execution host. Is it the problem?
>>>>>> This should not be a problem. Maybe you will have to change some
>>>>>> settings.
>>>>>>
>>>>>>
>>>>>>> By putty software I m able to connect from windows execution host
>>> to
>>>>>>> RHEL master host through SSH. But by telnet it is showing some
>>>>> network
>>>>>>> error? How can I fix this?
>>>>>> I think you got me wrong. Try to connect to the qmaster itself,
> not
>>> to
>>>>>> the telnetd of the qmaster host. Use
>>>>>> # telnet gridserver.sunnonegrid-bangalore.com 536
>>>>>>
>>>>>> It should print
>>>>>>
>>>>>> Trying [IP-Adress of gridserver.sunnonegrid-bangalore-com]...
>>>>>> Connected to gridserver.sunnonegrid-bangalore.com.
>>>>>> Escape character is '^]'.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Harald
>>>>>>
>>>>>>
>>>>>>> Thanks Sandeep
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Harald.Pollinger at Sun.COM [mailto:Harald.Pollinger at Sun.COM] 
>>>>>>> Sent: Tuesday, November 13, 2007 2:50 PM
>>>>>>> To: users at gridengine.sunsource.net
>>>>>>> Subject: Re: [GE users] grid engine problem
>>>>>>>
>>>>>>> Sandeep, Patel(IE10) wrote:
>>>>>>>> Hi
>>>>>>>>     I checked messages and I got something like this
>>>>>>>>                                       11/13/2007 
>>>>>>>> 12:38:16|execd|ie10dtdc3zl1s|E|commlib error: endpoint is
>>>>>>> not
>>>>>>>> unique error (endpoint
>>>>>> "ie10dtdc3zl1s.global.ds.honeywell.com/execd/1"
>>>>>>>> is already connected)
>>>>>>> Are there more than one "sge_execd" instances running on that
>> host?
>>>>>>> If yes, please kill all and start only one of them again.
>>>>>>>
>>>>>>>
>>>>>>>> 11/13/2007 12:38:16|execd|ie10dtdc3zl1s|E|getting configuration:
>>>>>>> unable
>>>>>>>> to contact qmaster using port 536 on host
>>>>>>>> "gridserver.sunnonegrid-bangalore.com"
>>>>>>> Is there a firewall running somewhere on or between the execution
>>>>> host
>>>>>>> and the master host?
>>>>>>> Is it possible to connect from the execution host to the qmaster
>>>>> using
>>>>>>> telnet?
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Harald
>>>>>>>
>>>>>>>
>>>>>>>> 11/13/2007 12:38:19|execd|ie10dtdc3zl1s|E|can't get
> configuration
>>>>>> from
>>>>>>>> qmaster -- backgrounding
>>>>>>>>
>>>>>>>> How to solve this problem
>>>>>>>>
>>>>>>>> Thanks sandeep
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Ravichandra.Nallan at Sun.COM
>>> [mailto:Ravichandra.Nallan at Sun.COM]
>>>>>>>> Sent: Tuesday, November 13, 2007 12:25 PM
>>>>>>>> To: users at gridengine.sunsource.net
>>>>>>>> Subject: Re: [GE users] grid engine problem
>>>>>>>>
>>>>>>>> Hi Sandeep,
>>>>>>>>  From the qstat o/p it is evident (states au) that the execd on
>>> host
>>>>>>>> ie10dtdc3z11s.<something....> is not up. Check if there are any
>>>>>>> problems
>>>>>>>> for the execd not coming up. (check 
>>>>>>>> $SGE_ROOT/$SGE_CELL/spool/<hostname>/messages ).
>>>>>>>> This is the reason why the jobs are not scheduled to this host.
>>>>>>>>
>>>>>>>> (For info on queue states check qstat(1) man page, you could
> also
>>>>> see
>>>>>>> in
>>>>>>>> qstat that the load_avg/arch is -NA- !! ).
>>>>>>>>
>>>>>>>> Hope this helps.
>>>>>>>> regards,
>>>>>>>> ~Ravi
>>>>>>>>
>>>>>>>> Sandeep, Patel(IE10) wrote:
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> 1. I have my *master *host in RHEL.
>>>>>>>>>
>>>>>>>>> 2. I have two *execution* host
>>>>>>>>>
>>>>>>>>> A. one is on *windows *
>>>>>>>>>
>>>>>>>>> B. other one is on *RHEL*
>>>>>>>>>
>>>>>>>>> 3. When I m submitting the job *simple.sh(4times) , *when I m
>>>>> typing
>>>>>>>>> the command *qstat -f , * then the job is always going to the 
>>>>>>>>> RHEL execution host for execution because the
>>>>>>>>>
>>>>>>>>> Used by/total *is 2/2* for RHEL , but for *windows 0/2.the*
> jobs
>>>>> are
>>>>>>>>> *pending* for some time and *later taken by* RHEL execution
>> host.
>>>>>>>>> 4. It means the job is not distributed among the hosts *!!!!*
>>>>>>>>>
>>>>>>>>> 5. How can I solve this?
>>>>>>>>>
>>>>>>>>> 6. In this connection I have *attached* some *screen shots*.
> Can
>>>>>>>>> u please check it out?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> sandeep
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
> 
> 


-- 
Sun Microsystems GmbH         Harald Pollinger
Dr.-Leo-Ritter-Str. 7         N1 Grid Engine Engineering
D-93049 Regensburg            Phone: +49 (0)941 3075-209  (x60209)
Germany                       Fax: +49 (0)941 3075-222  (x60222)
http://www.sun.com/gridware
mailto:harald.pollinger at sun.com
Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1,
D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list