[GE users] RE: [GE dev] Re: [GE users] sge v6.0u3 new installation issue with more than 1021 hosts.

Ron Chen ron_chen_123 at yahoo.com
Fri Mar 11 17:28:51 GMT 2005


--- Andy Schwierskott wrote:
> Hmm, interesting article. The suggested fix silently
> implies that the kernel
> is ready to process much bigger structs than just
> FD_SETSIZE of 1024.

FreeBSD has done it right, in sys/select.h:

* Select uses bit masks of file descriptors in longs.
* These macros manipulate such bit fields (the
* filesystem macros use chars). FD_SETSIZE may be
* defined by the user, but the default here should
* be enough for most uses.

#ifndef FD_SETSIZE
#define FD_SETSIZE      1024U
#endif

And then I looked at the Linux header files, there are
macros which take the value of FD_SETSIZE to compute
the size of the array pass into select.

 -Ron

 
> At least on Solaris you should get EINVAL:
> 
>       EINVAL
>             The nfds argument is  less  than  0  or 
> greater  than
>             FD_SETSIZE.
> 
> I'm not (yet) ready to believe this can work.
> 
> Andy
> 
> >
> > Seems like you don't need to recompile other stuff
> for
> > Apache, so it should be the same for SGE. But
> please
> > only do this with your test cluster, or otherwise
> your
> > angry users will not let you leave for the weekend
> :)
> >
> > Or you can move the master machine to another OS,
> you
> > can still leave the exec hosts on Linux.
> >
> > -Ron
> >
> >
> >> sometimes it hangs for a while and
> >> works, sometimes it
> >> terminates with
> >> msgs's "got connect timeout: connect timeout
> error"
> >> and "endpoint
> >> beo297.ihess.com/qmaster/1
> >> at port 538: can't find connection" .  I'm not 
> sure
> >> I can get by with
> >> this unpredictability in responsiveness,
> especially
> >> if qsub works the
> >> same way.  I have a feeling my user community
> will
> >> not find this acceptable.
> >>
> >> Next question.  Will I encounter the same type of
> >> problem with sge
> >> v53p6, if i cross the 1020 node
> >> threshold?  i hesitate to "try it and see if it
> >> breaks" on my currently
> >> running production system,
> >> but I can if necessary.
> >>
> >> thanks...
> >> mac
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Andy Schwierskott
> >> [mailto:andy.schwierskott at sun.com]
> >> Sent: Friday, March 11, 2005 5:07 AM
> >> To: users at gridengine.sunsource.net
> >> Cc: dev at gridengine.sunsource.net
> >> Subject: [GE dev] Re: [GE users] sge v6.0u3 new
> >> installation issue with
> >> more than 1021 hosts.
> >>
> >> Mac,
> >>
> >> (CC'ing to "dev" because of my technical question
> in
> >> the end).
> >>
> >>> I also have a trace of qmaster from startup to
> >> occurrence of this with
> >> dl
> >>> set to 1 if you would need it.  I haven't gotten
> >> too far into it
> >> myself
> >>> yet....will open the issue soon as I get to the
> >> office.
> >>
> >> Here's a preliminary analysis:
> >>
> >> There is a bug in the commlib and a limitation on
> >> several OS'es
> >>
> >> In Linux (kernels 2.4, 2.6; x86 AND AMD64!) the
> >> maximum size of fd's
> >> which can be given as an argument to
> >> select() is limited to 1024:
> >>
> >> /usr/include/bits/types.h: bits/types.h:#define
> >> __FD_SETSIZE       1024
> >>
> >> The same limit is on Mac OS/X, 32bit Solaris
> (x86,
> >> Sparc, SGI, HP). It's
> >> 32767 on IBM, 4096 on Tru64, and 65536 on Solaris
> >> 64bit (Sparc and
> >> AMD64).
> >>
> >> The SGE code has a bug that it does not check
> >> properly at startup about
> >> this
> >> limit (there's just a debug logging at runtime if
> >> the number of actual
> >> connections exceeds FD_SETSIZE, however this has
> no
> >> consequences).
> >>
> >> The workaround is to limit the max. number of
> fd's
> >> at qmaster startup to
> >> not
> >> more than 1024 - this should work with 6.0u3 or
> >> earlier (but not 6.0u2).
> >>
> >> Question to the community: I think poll() is the
> >> correct alternative to
> >> circumvent this problem. Am I right? Or are there
> >> other limitations with
> >> poll() (e.g. speed)?
> >>
> >> Thanks,
> >> Andy
> >>
> >>
> >>> Regards.....mac
> >>> Mac McCalla
> >>> --------------------------
> >>> Sent from my BlackBerry Wireless Handheld
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Andy Schwierskott
> >> <andy.schwierskott at sun.com>
> >>> To: users at gridengine.sunsource.net
> >> <users at gridengine.sunsource.net>
> >>> Sent: Fri Mar 11 02:21:06 2005
> >>> Subject: RE: [GE users] sge v6.0u3 new
> >> installation issue with more
> >> than 1021 hosts.
> >>>
> >>> Mac,
> >>>
> >>> yes, please open an issue and attach your qping
> >> dump.
> >>>
> >>> Andy
> >>>
> >>>>> Ron Chen wrote:
> >>>>> OK, so if you stop the execd on one of the
> 1021
> >> hosts,
> >>>>> would the 1022th host work?
> >>>>
> >>>> i was using the lines in "netstat -a |grep
> 
=== message truncated ===



		
__________________________________ 
Do you Yahoo!? 
Yahoo! Small Business - Try our new resources site!
http://smallbusiness.yahoo.com/resources/ 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list