[GE users] RE: [GE dev] Re: [GE users] sge v6.0u3 new installation issue with more than 1021 hosts.

Andy Schwierskott andy.schwierskott at sun.com
Fri Mar 11 17:13:55 GMT 2005


Ron,

> --- "McCalla, Mac" <macmccalla at hess.com> wrote:
>> 	I have been trying to use qhost and qping
>> commands to figure out what nodes
>> are successfully running execd and which aren't so I
>> can go restart them (sgeexecd).  Response
>> from the qmaster running ulimit -n1000 is sporadic
>> at best.  sometimes the command works
>> immediately
>
> Mac,
>
> One thing you can try with the SGE6 cluster is to
> compile SGE yourself and set a higher limit.
>
> I googled a bit and found that many software packages
> (Apache, OpenLDAP) break on Linux when they hit this
> problem.
>
> One can set FD_SETSIZE at compile time and get a
> higher limit, but there is a bug in the header file
> that makes the user setting ignored. And glibc
> maintainers say that they won't fix this and ask
> people to use poll instead :(
>
> You can follow this to modify the system include file
> and then recompile SGE:
> http://five2one.org/stdio/retrospect/fd_setsize_error_on_red_hat_enterprise.html

Hmm, interesting article. The suggested fix silently implies that the kernel
is ready to process much bigger structs than just FD_SETSIZE of 1024.

At least on Solaris you should get EINVAL:

      EINVAL
            The nfds argument is  less  than  0  or  greater  than
            FD_SETSIZE.

I'm not (yet) ready to believe this can work.

Andy

>
> Seems like you don't need to recompile other stuff for
> Apache, so it should be the same for SGE. But please
> only do this with your test cluster, or otherwise your
> angry users will not let you leave for the weekend :)
>
> Or you can move the master machine to another OS, you
> can still leave the exec hosts on Linux.
>
> -Ron
>
>
>> sometimes it hangs for a while and
>> works, sometimes it
>> terminates with
>> msgs's "got connect timeout: connect timeout error"
>> and "endpoint
>> beo297.ihess.com/qmaster/1
>> at port 538: can't find connection" .  I'm not  sure
>> I can get by with
>> this unpredictability in responsiveness, especially
>> if qsub works the
>> same way.  I have a feeling my user community will
>> not find this acceptable.
>>
>> Next question.  Will I encounter the same type of
>> problem with sge
>> v53p6, if i cross the 1020 node
>> threshold?  i hesitate to "try it and see if it
>> breaks" on my currently
>> running production system,
>> but I can if necessary.
>>
>> thanks...
>> mac
>>
>>
>>
>> -----Original Message-----
>> From: Andy Schwierskott
>> [mailto:andy.schwierskott at sun.com]
>> Sent: Friday, March 11, 2005 5:07 AM
>> To: users at gridengine.sunsource.net
>> Cc: dev at gridengine.sunsource.net
>> Subject: [GE dev] Re: [GE users] sge v6.0u3 new
>> installation issue with
>> more than 1021 hosts.
>>
>> Mac,
>>
>> (CC'ing to "dev" because of my technical question in
>> the end).
>>
>>> I also have a trace of qmaster from startup to
>> occurrence of this with
>> dl
>>> set to 1 if you would need it.  I haven't gotten
>> too far into it
>> myself
>>> yet....will open the issue soon as I get to the
>> office.
>>
>> Here's a preliminary analysis:
>>
>> There is a bug in the commlib and a limitation on
>> several OS'es
>>
>> In Linux (kernels 2.4, 2.6; x86 AND AMD64!) the
>> maximum size of fd's
>> which can be given as an argument to
>> select() is limited to 1024:
>>
>> /usr/include/bits/types.h: bits/types.h:#define
>> __FD_SETSIZE       1024
>>
>> The same limit is on Mac OS/X, 32bit Solaris (x86,
>> Sparc, SGI, HP). It's
>> 32767 on IBM, 4096 on Tru64, and 65536 on Solaris
>> 64bit (Sparc and
>> AMD64).
>>
>> The SGE code has a bug that it does not check
>> properly at startup about
>> this
>> limit (there's just a debug logging at runtime if
>> the number of actual
>> connections exceeds FD_SETSIZE, however this has no
>> consequences).
>>
>> The workaround is to limit the max. number of fd's
>> at qmaster startup to
>> not
>> more than 1024 - this should work with 6.0u3 or
>> earlier (but not 6.0u2).
>>
>> Question to the community: I think poll() is the
>> correct alternative to
>> circumvent this problem. Am I right? Or are there
>> other limitations with
>> poll() (e.g. speed)?
>>
>> Thanks,
>> Andy
>>
>>
>>> Regards.....mac
>>> Mac McCalla
>>> --------------------------
>>> Sent from my BlackBerry Wireless Handheld
>>>
>>>
>>> -----Original Message-----
>>> From: Andy Schwierskott
>> <andy.schwierskott at sun.com>
>>> To: users at gridengine.sunsource.net
>> <users at gridengine.sunsource.net>
>>> Sent: Fri Mar 11 02:21:06 2005
>>> Subject: RE: [GE users] sge v6.0u3 new
>> installation issue with more
>> than 1021 hosts.
>>>
>>> Mac,
>>>
>>> yes, please open an issue and attach your qping
>> dump.
>>>
>>> Andy
>>>
>>>>> Ron Chen wrote:
>>>>> OK, so if you stop the execd on one of the 1021
>> hosts,
>>>>> would the 1022th host work?
>>>>
>>>> i was using the lines in "netstat -a |grep
>> qmast|wc" as the number
>> but
>>>> turns out there is 1 more line than hosts, so
>> there are actually 1020
>>>> hosts connected ok...then when the 1021'st tries
>> to connect it
>> appears
>>>> that
>>>> qmaster stops accepting new connections.  the
>> answer to your question
>> in
>>>> principle is yes.  if  i go to one of the
>> connected hosts and kill
>> the
>>>> execd
>>>> (can't shut it down nicely because it appears
>> that takes a new
>>>> connection?
>>>> and hangs), then in a few minutes, qhost..etc
>> commands will work and
>>>> i can start an execd from another host...then
>> commands will stop
>>>> working....
>>>>
>>>>> Any error messages in the qmaster log file? And
>> does
>>>>> strace tell you anything?
>>>>
>>>> No error messages in the qmaster log file other
>> than ones previously
>>>> mentioned
>>>> which i believe are the same as documented in
>> issue 1431 and of no
>>>> consequence.
>>>>
>>>> strace of an execd starting on a new host (number
>> 1021) did not show
>> me
>>>> anything
>>>> except lots of gettimeofday(....)=0
>>>>               select(5,[3 4],[], NULL, {1,0}) = 0
>> (Timeout)
>>>>               gettimeofday (....)
>>>>               gettimeofday (....)
>>>> 	         rt_sigprocmask(SIG_BLOCK,......)
>>>>               gettimeofday (....)
>>>>               gettimeofday (....)
>>>>               gettimeofday (....)
>>>>               gettimeofday (....)
>>>>
>> rt_sigprocmask(SIG_SETMASK,........)
>>>>               getpid() = 24197
>>>>               write  some trace stuff since i had
>> dl set to 1
>>>>               write some more trace info
>>>>               gettimeofday (....)
>>>>                 several more times
>>>>               select(5,[3 4],[], NULL, {1,0}) = 0
>> (Timeout)
>>>>               repeat sequence
>>>>
>>>>
>>>> Perhaps of more interest is results of running
>> qping on new host and
>>>> qmaster at same time.
>>>> host b00501 is new host.  command is "qping -info
>> beo297 538 qmaster
>> 1"
>>>> terminates after some time with message
>>
> === message truncated ===

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list