[GE users] sge v6.0u3 new installation issue with more than 1021 hosts.

McCalla, Mac macmccalla at hess.com
Fri Mar 11 17:00:31 GMT 2005


Hi Ron,

Thanks for the info. You have helped me make up my mind not to risk
going "production" with
a reduced set of nodes quite yet.  I have temporarily lost my window of
opportunity to cut
over so production will stay with v53p6 for the next few days while we
decide what to do.
I will let you know what happens with v60u3.

Best Regards,

mac mccalla 

-----Original Message-----
From: Ron Chen [mailto:ron_chen_123 at yahoo.com] 
Sent: Friday, March 11, 2005 10:40 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] RE: [GE dev] Re: [GE users] sge v6.0u3 new
installation issue with more than 1021 hosts.

--- "McCalla, Mac" <macmccalla at hess.com> wrote:
> 	I have been trying to use qhost and qping 
> commands to figure out what nodes
> are successfully running execd and which aren't so I
> can go restart them (sgeexecd).  Response
> from the qmaster running ulimit -n1000 is sporadic
> at best.  sometimes the command works
> immediately

Mac,

One thing you can try with the SGE6 cluster is to
compile SGE yourself and set a higher limit.

I googled a bit and found that many software packages
(Apache, OpenLDAP) break on Linux when they hit this
problem.

One can set FD_SETSIZE at compile time and get a
higher limit, but there is a bug in the header file
that makes the user setting ignored. And glibc
maintainers say that they won't fix this and ask
people to use poll instead :(

You can follow this to modify the system include file
and then recompile SGE:
http://five2one.org/stdio/retrospect/fd_setsize_error_on_red_hat_enterpr
ise.html

Seems like you don't need to recompile other stuff for
Apache, so it should be the same for SGE. But please
only do this with your test cluster, or otherwise your
angry users will not let you leave for the weekend :)

Or you can move the master machine to another OS, you
can still leave the exec hosts on Linux.

 -Ron


> sometimes it hangs for a while and
> works, sometimes it
> terminates with 
> msgs's "got connect timeout: connect timeout error"
> and "endpoint
> beo297.ihess.com/qmaster/1
> at port 538: can't find connection" .  I'm not  sure
> I can get by with
> this unpredictability in responsiveness, especially
> if qsub works the
> same way.  I have a feeling my user community will
> not find this acceptable.  
> 
> Next question.  Will I encounter the same type of
> problem with sge
> v53p6, if i cross the 1020 node
> threshold?  i hesitate to "try it and see if it
> breaks" on my currently
> running production system,
> but I can if necessary.
> 
> thanks...
> mac  
> 
>  
> 
> -----Original Message-----
> From: Andy Schwierskott
> [mailto:andy.schwierskott at sun.com] 
> Sent: Friday, March 11, 2005 5:07 AM
> To: users at gridengine.sunsource.net
> Cc: dev at gridengine.sunsource.net
> Subject: [GE dev] Re: [GE users] sge v6.0u3 new
> installation issue with
> more than 1021 hosts.
> 
> Mac,
> 
> (CC'ing to "dev" because of my technical question in
> the end).
> 
> > I also have a trace of qmaster from startup to
> occurrence of this with
> dl
> > set to 1 if you would need it.  I haven't gotten
> too far into it
> myself
> > yet....will open the issue soon as I get to the
> office.
> 
> Here's a preliminary analysis:
> 
> There is a bug in the commlib and a limitation on
> several OS'es
> 
> In Linux (kernels 2.4, 2.6; x86 AND AMD64!) the
> maximum size of fd's
> which can be given as an argument to
> select() is limited to 1024:
> 
> /usr/include/bits/types.h: bits/types.h:#define
> __FD_SETSIZE       1024
> 
> The same limit is on Mac OS/X, 32bit Solaris (x86,
> Sparc, SGI, HP). It's
> 32767 on IBM, 4096 on Tru64, and 65536 on Solaris
> 64bit (Sparc and
> AMD64).
> 
> The SGE code has a bug that it does not check
> properly at startup about
> this
> limit (there's just a debug logging at runtime if
> the number of actual
> connections exceeds FD_SETSIZE, however this has no
> consequences).
> 
> The workaround is to limit the max. number of fd's
> at qmaster startup to
> not
> more than 1024 - this should work with 6.0u3 or
> earlier (but not 6.0u2).
> 
> Question to the community: I think poll() is the
> correct alternative to
> circumvent this problem. Am I right? Or are there
> other limitations with
> poll() (e.g. speed)?
> 
> Thanks,
> Andy
> 
> 
> > Regards.....mac
> > Mac McCalla
> > --------------------------
> > Sent from my BlackBerry Wireless Handheld
> >
> >
> > -----Original Message-----
> > From: Andy Schwierskott
> <andy.schwierskott at sun.com>
> > To: users at gridengine.sunsource.net
> <users at gridengine.sunsource.net>
> > Sent: Fri Mar 11 02:21:06 2005
> > Subject: RE: [GE users] sge v6.0u3 new
> installation issue with more
> than 1021 hosts.
> >
> > Mac,
> >
> > yes, please open an issue and attach your qping
> dump.
> >
> > Andy
> >
> >>> Ron Chen wrote:
> >>> OK, so if you stop the execd on one of the 1021
> hosts,
> >>> would the 1022th host work?
> >>
> >> i was using the lines in "netstat -a |grep
> qmast|wc" as the number
> but
> >> turns out there is 1 more line than hosts, so
> there are actually 1020
> >> hosts connected ok...then when the 1021'st tries
> to connect it
> appears
> >> that
> >> qmaster stops accepting new connections.  the
> answer to your question
> in
> >> principle is yes.  if  i go to one of the
> connected hosts and kill
> the
> >> execd
> >> (can't shut it down nicely because it appears
> that takes a new
> >> connection?
> >> and hangs), then in a few minutes, qhost..etc
> commands will work and
> >> i can start an execd from another host...then
> commands will stop
> >> working....
> >>
> >>> Any error messages in the qmaster log file? And
> does
> >>> strace tell you anything?
> >>
> >> No error messages in the qmaster log file other
> than ones previously
> >> mentioned
> >> which i believe are the same as documented in
> issue 1431 and of no
> >> consequence.
> >>
> >> strace of an execd starting on a new host (number
> 1021) did not show
> me
> >> anything
> >> except lots of gettimeofday(....)=0
> >>               select(5,[3 4],[], NULL, {1,0}) = 0
> (Timeout)
> >>               gettimeofday (....)
> >>               gettimeofday (....)
> >> 	         rt_sigprocmask(SIG_BLOCK,......)
> >>               gettimeofday (....)
> >>               gettimeofday (....)
> >>               gettimeofday (....)
> >>               gettimeofday (....)
> >>              
> rt_sigprocmask(SIG_SETMASK,........)
> >>               getpid() = 24197
> >>               write  some trace stuff since i had
> dl set to 1
> >>               write some more trace info
> >>               gettimeofday (....)
> >>                 several more times
> >>               select(5,[3 4],[], NULL, {1,0}) = 0
> (Timeout)
> >>               repeat sequence
> >>
> >>
> >> Perhaps of more interest is results of running
> qping on new host and
> >> qmaster at same time.
> >> host b00501 is new host.  command is "qping -info
> beo297 538 qmaster
> 1"
> >> terminates after some time with message
> 
=== message truncated ===



		
__________________________________ 
Do you Yahoo!? 
Yahoo! Small Business - Try our new resources site!
http://smallbusiness.yahoo.com/resources/ 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list