[GE users] SSH connection refused - intermittent
aeverett at forteds.com
Mon Mar 22 17:36:07 GMT 2010
[ The following text is in the "iso-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Yes, indeed it is running as real user root:
sgeadmin root /tools/sge/bin/lx24-amd64/sge_execd
We don't run any interactive jobs, and I've never seen the number of processes on a machine exceed ~2000, I don't have any data on how many of those are ssh. The exec hosts are all dual quad core blades with 16gb ram. Load stays steady in the mid 30's and we stay out of swap space. There should be plenty of CPU for ssh and sshd, so I believe we're hitting some sort of configured connection limit or rate control.
On Mon, Mar 22, 2010 at 10:32 AM, reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>> wrote:
Am 22.03.2010 um 15:26 schrieb giftedplacebo:
> All machines are running sge_execd as user sgeadmin (an account we created for running all our sge_ processes).
when it's working for all users in general I assume it running as running as real user root.
$ ps -e f -o user,ruser,command
USER RUSER COMMAND
sgeadmin root /usr/sge/bin/lx24-x86/sge_execd
> The problem occurs on all machines.
> Since my original email, I bumped ConnectionAttempts from 20 to 50, and the errors have gone away, but I believe it is only masking the problem by allowing ssh more attempts to try to connect. I'd like to solve the underlying problem.
How often an SSH connection is made by your jobs? And/or is this only happening for interactive jobs?
> Best regards,
> On Sat, Mar 20, 2010 at 6:30 PM, reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>> wrote:
> Am 15.03.2010 um 16:01 schrieb giftedplacebo:
> > We have been running sge for several years now, currently running
> > 6.0u10. We recently started seeing lots of ssh failures (5-9%) like
> > the following:
> > ssh: connect to host grid057.<mydomain>.com port 45364: Connection
> > refused
> is the execd running on some machines not as root? Or is this
> happening on all machines in the cluster and not only certain ones?
> -- Reuti
> > grid057 appears to accept the connection, this is the corresponding /
> > var/log/messages entry:
> > Mar 9 06:36:15 grid057 sshd: Accepted publickey for
> > <username> from 172.16.14.157 port 45364 ssh2
> > (<mydomain> and <username> have been removed for privacy.)
> > On all grid nodes I have selinux and iptables disabled.
> > sshd is running with the following /etc/ssh/sshd_config
> > X11Forwarding yes
> > PrintMotd no
> > MaxStartups 10000:1:10000
> > Subsystem sftp /usr/libexec/openssh/sftp-server
> > /etc/ssh/ssh_config on all nodes is:
> > Host *
> > RhostsRSAAuthentication yes
> > StrictHostKeyChecking no
> > ConnectionAttempts 20
> > I have also set the following to 3000:
> > /proc/sys/net/core/netdev_max_backlog
> > /proc/sys/net/core/somaxconn
> > The problem is across all machines, and only affects ~5-9% of ssh
> > connections. I don't see any error messages on the machines, just
> > the ssh failure notice in our job log files. Does anyone have ideas
> > or tips on tuning ssh/sshd? Thanks!
> > Best regards,
> > Aaron
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].
More information about the gridengine-users