[GE users] sge v6.0u3 new installation issue with more than 1021 hosts.

Andy Schwierskott andy.schwierskott at sun.com
Fri Mar 11 08:21:06 GMT 2005


Mac,

yes, please open an issue and attach your qping dump.

Andy

>> Ron Chen wrote:
>> OK, so if you stop the execd on one of the 1021 hosts,
>> would the 1022th host work?
>
> i was using the lines in "netstat -a |grep qmast|wc" as the number but
> turns out there is 1 more line than hosts, so there are actually 1020
> hosts connected ok...then when the 1021'st tries to connect it appears
> that
> qmaster stops accepting new connections.  the answer to your question in
> principle is yes.  if  i go to one of the connected hosts and kill the
> execd
> (can't shut it down nicely because it appears that takes a new
> connection?
> and hangs), then in a few minutes, qhost..etc commands will work and
> i can start an execd from another host...then commands will stop
> working....
>
>> Any error messages in the qmaster log file? And does
>> strace tell you anything?
>
> No error messages in the qmaster log file other than ones previously
> mentioned
> which i believe are the same as documented in issue 1431 and of no
> consequence.
>
> strace of an execd starting on a new host (number 1021) did not show me
> anything
> except lots of gettimeofday(....)=0
>               select(5,[3 4],[], NULL, {1,0}) = 0 (Timeout)
>               gettimeofday (....)
>               gettimeofday (....)
> 	         rt_sigprocmask(SIG_BLOCK,......)
>               gettimeofday (....)
>               gettimeofday (....)
>               gettimeofday (....)
>               gettimeofday (....)
>               rt_sigprocmask(SIG_SETMASK,........)
>               getpid() = 24197
>               write  some trace stuff since i had dl set to 1
>               write some more trace info
>               gettimeofday (....)
>                 several more times
>               select(5,[3 4],[], NULL, {1,0}) = 0 (Timeout)
>               repeat sequence
>
>
> Perhaps of more interest is results of running qping on new host and
> qmaster at same time.
> host b00501 is new host.  command is "qping -info beo297 538 qmaster 1"
> terminates after some time with message
> endpoint beo297.ihess.com/qmaster/1 at port 538: can't find connection
>
> qmaster is running on host beo297.  command is qping -dump beo297 538
> qmaster 1
> qping -dump command was running before qping on b00501 is executed.
> time of day
> difference of +5m30s from b00501 to beo297.  i have attached gzip'd file
> of qping -dump
> command (about 56k) in size.
>
> should i open this as an issue??
>
> regards...mac mccalla
>
>
>> -Ron
>
>
> -----Original Message-----
> From: Ron Chen [mailto:ron_chen_123 at yahoo.com]
> Sent: Thursday, March 10, 2005 11:37 AM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] sge v6.0u3 new installation issue with more than
> 1021 hosts.
>
> OK, so if you stop the execd on one of the 1021 hosts,
> would the 1022th host work?
>
> Any error messages in the qmaster log file? And does
> strace tell you anything?
>
> -Ron
>
> --- "McCalla, Mac" <macmccalla at hess.com> wrote:
>> qmaster stopped responding on port
>> 538 to any further
>> requests from additional execd's or commands
>> (qstat,qhost
>> ,etc).   the ulimit for fd's is set at 4096 at
>> qmaster startup (the info
>> message at qmaster startup says qmaster will use
>> 4076 file
>> descriptors for communication).  Has anyone else see
>> this problem or
>> have a 6.0u3 installation with more hosts?
>>
>> thanks in advance,
>> Mac McCalla
>>
>>
>
>
>
> __________________________________
> Do you Yahoo!?
> Yahoo! Small Business - Try our new resources site!
> http://smallbusiness.yahoo.com/resources/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>


Andy

--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Andy Schwierskott           Tel: +49 (0)941 3075-200 (x60200)
N1 Grid Engine Engineering  Fax: +49 (0)941 3075-222 (x60222)
Sun Microsystems GmbH
Dr.-Leo-Ritter-Str. 7       mailto:andy.schwierskott at sun.com
D-93049 Regensburg          http://www.sun.com/gridware

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list