[GE users] sge v6.0u3 new installation issue with more than 1021 hosts.

Ron Chen ron_chen_123 at yahoo.com
Fri Mar 11 13:58:20 GMT 2005


Hi Mac,

Please also open an issue to track the "SGE not
freeing up client endpoints" problem in SGE 6.

 -Ron

P.S. I was out of office today and when I came back
seems like the 1020 hosts problem is resolved by Andy
(well done!) :)


--- Andy Schwierskott wrote:
> Mac,
> 
> yes, please open an issue and attach your qping
> dump.
> 
> Andy
> 
> >> Ron Chen wrote:
> >> OK, so if you stop the execd on one of the 1021
> hosts,
> >> would the 1022th host work?
> >
> > i was using the lines in "netstat -a |grep
> qmast|wc" as the number but
> > turns out there is 1 more line than hosts, so
> there are actually 1020
> > hosts connected ok...then when the 1021'st tries
> to connect it appears
> > that
> > qmaster stops accepting new connections.  the
> answer to your question in
> > principle is yes.  if  i go to one of the
> connected hosts and kill the
> > execd
> > (can't shut it down nicely because it appears that
> takes a new
> > connection?
> > and hangs), then in a few minutes, qhost..etc
> commands will work and
> > i can start an execd from another host...then
> commands will stop
> > working....
> >
> >> Any error messages in the qmaster log file? And
> does
> >> strace tell you anything?
> >
> > No error messages in the qmaster log file other
> than ones previously
> > mentioned
> > which i believe are the same as documented in
> issue 1431 and of no
> > consequence.
> >
> > strace of an execd starting on a new host (number
> 1021) did not show me
> > anything
> > except lots of gettimeofday(....)=0
> >               select(5,[3 4],[], NULL, {1,0}) = 0
> (Timeout)
> >               gettimeofday (....)
> >               gettimeofday (....)
> > 	         rt_sigprocmask(SIG_BLOCK,......)
> >               gettimeofday (....)
> >               gettimeofday (....)
> >               gettimeofday (....)
> >               gettimeofday (....)
> >               rt_sigprocmask(SIG_SETMASK,........)
> >               getpid() = 24197
> >               write  some trace stuff since i had
> dl set to 1
> >               write some more trace info
> >               gettimeofday (....)
> >                 several more times
> >               select(5,[3 4],[], NULL, {1,0}) = 0
> (Timeout)
> >               repeat sequence
> >
> >
> > Perhaps of more interest is results of running
> qping on new host and
> > qmaster at same time.
> > host b00501 is new host.  command is "qping -info
> beo297 538 qmaster 1"
> > terminates after some time with message
> > endpoint beo297.ihess.com/qmaster/1 at port 538:
> can't find connection
> >
> > qmaster is running on host beo297.  command is
> qping -dump beo297 538
> > qmaster 1
> > qping -dump command was running before qping on
> b00501 is executed.
> > time of day
> > difference of +5m30s from b00501 to beo297.  i
> have attached gzip'd file
> > of qping -dump
> > command (about 56k) in size.
> >
> > should i open this as an issue??
> >
> > regards...mac mccalla
> >
> >
> >> -Ron
> >
> >
> > -----Original Message-----
> > From: Ron Chen [mailto:ron_chen_123 at yahoo.com]
> > Sent: Thursday, March 10, 2005 11:37 AM
> > To: users at gridengine.sunsource.net
> > Subject: Re: [GE users] sge v6.0u3 new
> installation issue with more than
> > 1021 hosts.
> >
> > OK, so if you stop the execd on one of the 1021
> hosts,
> > would the 1022th host work?
> >
> > Any error messages in the qmaster log file? And
> does
> > strace tell you anything?
> >
> > -Ron
> >
> > --- "McCalla, Mac" <macmccalla at hess.com> wrote:
> >> qmaster stopped responding on port
> >> 538 to any further
> >> requests from additional execd's or commands
> >> (qstat,qhost
> >> ,etc).   the ulimit for fd's is set at 4096 at
> >> qmaster startup (the info
> >> message at qmaster startup says qmaster will use
> >> 4076 file
> >> descriptors for communication).  Has anyone else
> see
> >> this problem or
> >> have a 6.0u3 installation with more hosts?
> >>
> >> thanks in advance,
> >> Mac McCalla
> >>
> >>
> >
> >
> >
> > __________________________________
> > Do you Yahoo!?
> > Yahoo! Small Business - Try our new resources
> site!
> > http://smallbusiness.yahoo.com/resources/
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> >
> >
> >
> 
> 
> Andy
> 
> --
> - - - - - - - - - - - - - - - - - - - - - - - - - -
> - - - - -
> Andy Schwierskott           Tel: +49 (0)941 3075-200
> (x60200)
> N1 Grid Engine Engineering  Fax: +49 (0)941 3075-222
> (x60222)
> Sun Microsystems GmbH
> Dr.-Leo-Ritter-Str. 7      
> mailto:andy.schwierskott at sun.com
> D-93049 Regensburg         
> http://www.sun.com/gridware
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> 
> 



		
__________________________________ 
Do you Yahoo!? 
Yahoo! Small Business - Try our new resources site!
http://smallbusiness.yahoo.com/resources/ 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list