[GE users] sge v6.0u3 new installation issue with more than 1021 hosts.

McCalla, Mac macmccalla at hess.com
Thu Mar 10 19:32:35 GMT 2005


> Ron Chen wrote:
> OK, so if you stop the execd on one of the 1021 hosts,
> would the 1022th host work?

i was using the lines in "netstat -a |grep qmast|wc" as the number but 
turns out there is 1 more line than hosts, so there are actually 1020
hosts connected ok...then when the 1021'st tries to connect it appears
that
qmaster stops accepting new connections.  the answer to your question in
principle is yes.  if  i go to one of the connected hosts and kill the
execd
(can't shut it down nicely because it appears that takes a new
connection?
 and hangs), then in a few minutes, qhost..etc commands will work and 
i can start an execd from another host...then commands will stop
working....  

>Any error messages in the qmaster log file? And does
>strace tell you anything?

No error messages in the qmaster log file other than ones previously
mentioned
which i believe are the same as documented in issue 1431 and of no
consequence.

strace of an execd starting on a new host (number 1021) did not show me
anything
except lots of gettimeofday(....)=0
               select(5,[3 4],[], NULL, {1,0}) = 0 (Timeout)
               gettimeofday (....)
               gettimeofday (....) 
	         rt_sigprocmask(SIG_BLOCK,......)
               gettimeofday (....)
               gettimeofday (....)
               gettimeofday (....)
               gettimeofday (....)
               rt_sigprocmask(SIG_SETMASK,........)
               getpid() = 24197
               write  some trace stuff since i had dl set to 1
               write some more trace info
               gettimeofday (....)
                 several more times
               select(5,[3 4],[], NULL, {1,0}) = 0 (Timeout)
               repeat sequence


Perhaps of more interest is results of running qping on new host and
qmaster at same time.
host b00501 is new host.  command is "qping -info beo297 538 qmaster 1" 
terminates after some time with message
endpoint beo297.ihess.com/qmaster/1 at port 538: can't find connection

qmaster is running on host beo297.  command is qping -dump beo297 538
qmaster 1 
qping -dump command was running before qping on b00501 is executed.
time of day
difference of +5m30s from b00501 to beo297.  i have attached gzip'd file
of qping -dump 
command (about 56k) in size.

should i open this as an issue??

regards...mac mccalla   
      

> -Ron


-----Original Message-----
From: Ron Chen [mailto:ron_chen_123 at yahoo.com] 
Sent: Thursday, March 10, 2005 11:37 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] sge v6.0u3 new installation issue with more than
1021 hosts.

OK, so if you stop the execd on one of the 1021 hosts,
would the 1022th host work?

Any error messages in the qmaster log file? And does
strace tell you anything?

 -Ron

--- "McCalla, Mac" <macmccalla at hess.com> wrote:
> qmaster stopped responding on port
> 538 to any further
> requests from additional execd's or commands
> (qstat,qhost
> ,etc).   the ulimit for fd's is set at 4096 at
> qmaster startup (the info
> message at qmaster startup says qmaster will use
> 4076 file
> descriptors for communication).  Has anyone else see
> this problem or
> have a 6.0u3 installation with more hosts?  
> 
> thanks in advance,
> Mac McCalla 
> 
> 


		
__________________________________ 
Do you Yahoo!? 
Yahoo! Small Business - Try our new resources site!
http://smallbusiness.yahoo.com/resources/ 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




    [ Part 2, "qping.dump.gz"  Application/X-GZIP (Name: "qping.dump.gz") ]
    [ 58 KB. ]
    [ Unable to print this part. ]


    [ Part 3: "Attached Text" ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list