[GE users] SSH connection refused - intermittent

reuti reuti at staff.uni-marburg.de
Mon Mar 22 14:32:47 GMT 2010


Hi,

Am 22.03.2010 um 15:26 schrieb giftedplacebo:

> All machines are running sge_execd as user sgeadmin (an account we created for running all our sge_ processes).

when it's working for all users in general I assume it running as running as real user root.

$ ps -e f -o user,ruser,command
USER     RUSER    COMMAND
...
sgeadmin root     /usr/sge/bin/lx24-x86/sge_execd


> The problem occurs on all machines.
> 
> Since my original email, I bumped ConnectionAttempts from 20 to 50, and the errors have gone away, but I believe it is only masking the problem by allowing ssh more attempts to try to connect. I'd like to solve the underlying problem.

How often an SSH connection is made by your jobs? And/or is this only happening for interactive jobs?

-- Reuti


> Best regards,
> Aaron 
> 
> 
> On Sat, Mar 20, 2010 at 6:30 PM, reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
> 
> Am 15.03.2010 um 16:01 schrieb giftedplacebo:
> 
> > We have been running sge for several years now, currently running
> > 6.0u10. We recently started seeing lots of ssh failures (5-9%) like
> > the following:
> >
> > ssh: connect to host grid057.<mydomain>.com port 45364: Connection
> > refused
> 
> is the execd running on some machines not as root? Or is this
> happening on all machines in the cluster and not only certain ones?
> 
> -- Reuti
> 
> 
> > grid057 appears to accept the connection, this is the corresponding /
> > var/log/messages entry:
> >
> > Mar  9 06:36:15 grid057 sshd[14936]: Accepted publickey for
> > <username> from 172.16.14.157 port 45364 ssh2
> >
> > (<mydomain> and <username> have been removed for privacy.)
> >
> > On all grid nodes I have selinux and iptables disabled.
> >
> > sshd is running with the following /etc/ssh/sshd_config
> >
> > X11Forwarding yes
> > PrintMotd no
> > MaxStartups 10000:1:10000
> > Subsystem       sftp    /usr/libexec/openssh/sftp-server
> >
> > /etc/ssh/ssh_config on all nodes is:
> >
> > Host *
> >    RhostsRSAAuthentication yes
> >    StrictHostKeyChecking no
> >    ConnectionAttempts 20
> >
> > I have also set the following to 3000:
> >
> > /proc/sys/net/core/netdev_max_backlog
> > /proc/sys/net/core/somaxconn
> >
> > The problem is across all machines, and only affects ~5-9% of ssh
> > connections. I don't see any error messages on the machines, just
> > the ssh failure notice in our job log files. Does anyone have ideas
> > or tips on tuning ssh/sshd? Thanks!
> >
> > Best regards,
> > Aaron
> >
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=250055
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=250506

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list