[GE users] RE: [GE dev] Re: [GE users] sge v6.0u3 new installation issue with more than 1021 hosts.

McCalla, Mac macmccalla at hess.com
Fri Mar 11 16:07:42 GMT 2005


Andy,
	I have been trying to use qhost and qping commands to figure out
what nodes
are successfully running execd and which aren't so I can go restart them
(sgeexecd).  Response
from the qmaster running ulimit -n1000 is sporadic at best.  sometimes
the command works
immediately, sometimes it hangs for a while and works, sometimes it
terminates with 
msgs's "got connect timeout: connect timeout error" and "endpoint
beo297.ihess.com/qmaster/1
at port 538: can't find connection" .  I'm not  sure I can get by with
this unpredictability in responsiveness, especially if qsub works the
same way.  I have a feeling my user community will
not find this acceptable.  

Next question.  Will I encounter the same type of problem with sge
v53p6, if i cross the 1020 node
threshold?  i hesitate to "try it and see if it breaks" on my currently
running production system,
but I can if necessary.

thanks...
mac  

 

-----Original Message-----
From: Andy Schwierskott [mailto:andy.schwierskott at sun.com] 
Sent: Friday, March 11, 2005 5:07 AM
To: users at gridengine.sunsource.net
Cc: dev at gridengine.sunsource.net
Subject: [GE dev] Re: [GE users] sge v6.0u3 new installation issue with
more than 1021 hosts.

Mac,

(CC'ing to "dev" because of my technical question in the end).

> I also have a trace of qmaster from startup to occurrence of this with
dl
> set to 1 if you would need it.  I haven't gotten too far into it
myself
> yet....will open the issue soon as I get to the office.

Here's a preliminary analysis:

There is a bug in the commlib and a limitation on several OS'es

In Linux (kernels 2.4, 2.6; x86 AND AMD64!) the maximum size of fd's
which can be given as an argument to
select() is limited to 1024:

/usr/include/bits/types.h: bits/types.h:#define __FD_SETSIZE       1024

The same limit is on Mac OS/X, 32bit Solaris (x86, Sparc, SGI, HP). It's
32767 on IBM, 4096 on Tru64, and 65536 on Solaris 64bit (Sparc and
AMD64).

The SGE code has a bug that it does not check properly at startup about
this
limit (there's just a debug logging at runtime if the number of actual
connections exceeds FD_SETSIZE, however this has no consequences).

The workaround is to limit the max. number of fd's at qmaster startup to
not
more than 1024 - this should work with 6.0u3 or earlier (but not 6.0u2).

Question to the community: I think poll() is the correct alternative to
circumvent this problem. Am I right? Or are there other limitations with
poll() (e.g. speed)?

Thanks,
Andy


> Regards.....mac
> Mac McCalla
> --------------------------
> Sent from my BlackBerry Wireless Handheld
>
>
> -----Original Message-----
> From: Andy Schwierskott <andy.schwierskott at sun.com>
> To: users at gridengine.sunsource.net <users at gridengine.sunsource.net>
> Sent: Fri Mar 11 02:21:06 2005
> Subject: RE: [GE users] sge v6.0u3 new installation issue with more
than 1021 hosts.
>
> Mac,
>
> yes, please open an issue and attach your qping dump.
>
> Andy
>
>>> Ron Chen wrote:
>>> OK, so if you stop the execd on one of the 1021 hosts,
>>> would the 1022th host work?
>>
>> i was using the lines in "netstat -a |grep qmast|wc" as the number
but
>> turns out there is 1 more line than hosts, so there are actually 1020
>> hosts connected ok...then when the 1021'st tries to connect it
appears
>> that
>> qmaster stops accepting new connections.  the answer to your question
in
>> principle is yes.  if  i go to one of the connected hosts and kill
the
>> execd
>> (can't shut it down nicely because it appears that takes a new
>> connection?
>> and hangs), then in a few minutes, qhost..etc commands will work and
>> i can start an execd from another host...then commands will stop
>> working....
>>
>>> Any error messages in the qmaster log file? And does
>>> strace tell you anything?
>>
>> No error messages in the qmaster log file other than ones previously
>> mentioned
>> which i believe are the same as documented in issue 1431 and of no
>> consequence.
>>
>> strace of an execd starting on a new host (number 1021) did not show
me
>> anything
>> except lots of gettimeofday(....)=0
>>               select(5,[3 4],[], NULL, {1,0}) = 0 (Timeout)
>>               gettimeofday (....)
>>               gettimeofday (....)
>> 	         rt_sigprocmask(SIG_BLOCK,......)
>>               gettimeofday (....)
>>               gettimeofday (....)
>>               gettimeofday (....)
>>               gettimeofday (....)
>>               rt_sigprocmask(SIG_SETMASK,........)
>>               getpid() = 24197
>>               write  some trace stuff since i had dl set to 1
>>               write some more trace info
>>               gettimeofday (....)
>>                 several more times
>>               select(5,[3 4],[], NULL, {1,0}) = 0 (Timeout)
>>               repeat sequence
>>
>>
>> Perhaps of more interest is results of running qping on new host and
>> qmaster at same time.
>> host b00501 is new host.  command is "qping -info beo297 538 qmaster
1"
>> terminates after some time with message
>> endpoint beo297.ihess.com/qmaster/1 at port 538: can't find
connection
>>
>> qmaster is running on host beo297.  command is qping -dump beo297 538
>> qmaster 1
>> qping -dump command was running before qping on b00501 is executed.
>> time of day
>> difference of +5m30s from b00501 to beo297.  i have attached gzip'd
file
>> of qping -dump
>> command (about 56k) in size.
>>
>> should i open this as an issue??
>>
>> regards...mac mccalla
>>
>>
>>> -Ron
>>
>>
>> -----Original Message-----
>> From: Ron Chen [mailto:ron_chen_123 at yahoo.com]
>> Sent: Thursday, March 10, 2005 11:37 AM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] sge v6.0u3 new installation issue with more
than
>> 1021 hosts.
>>
>> OK, so if you stop the execd on one of the 1021 hosts,
>> would the 1022th host work?
>>
>> Any error messages in the qmaster log file? And does
>> strace tell you anything?
>>
>> -Ron
>>
>> --- "McCalla, Mac" <macmccalla at hess.com> wrote:
>>> qmaster stopped responding on port
>>> 538 to any further
>>> requests from additional execd's or commands
>>> (qstat,qhost
>>> ,etc).   the ulimit for fd's is set at 4096 at
>>> qmaster startup (the info
>>> message at qmaster startup says qmaster will use
>>> 4076 file
>>> descriptors for communication).  Has anyone else see
>>> this problem or
>>> have a 6.0u3 installation with more hosts?
>>>
>>> thanks in advance,
>>> Mac McCalla

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: dev-help at gridengine.sunsource.net



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list