[GE users] effect of automounting home folders in SGE environment under largish qmake loads?

Reuti reuti at staff.uni-marburg.de
Wed Oct 15 19:33:20 BST 2008


Hi Chris,

Am 15.10.2008 um 20:11 schrieb Chris Dagdigian:

> Trying to debug a partial application failure where the most  
> obvious STDERR looks exactly like what one would expect if  
> passwordless SSH hostkeys were missing or messed up:
>
>>> Permission denied, please try again.
>>> Permission denied, please try again.
>>> Permission denied (publickey,gssapi-with-mic,password).
>>> error: error reading returncode of remote command
>
> Of course manually SSH'ing into these nodes works perfectly and all  
> the permission/UID/GID stuff looks great. No problem with the SSH  
> key files from what I can tell.
>
> The application is the Solexa pipeline which is using "qmake" under  
> the hood to shotgun out lots of short and long running tasks.
>
> I just realized that this cluster is automounting individual user  
> home folders at login time.
>
> One explanation for random "permission denied" issues that appear  
> SSH key related would be if the cluster was under heavy load and  
> automount was hammeredh - missing SSH hostkeys on a node would  
> certainly cause the errors above if the automount was failing or  
> timing out on some or all of the nodes.
>
> I'm not an automount user myself so I wanted to run this by the  
> list -- it feels "right" to me that a heavy workload making use of  
> heavy qmake (aka 'qrsh') calls is going to put some stress on  
> automount as the folders get mounted (and presumably unmounted) as  
> tasks are scattered across nodes. And any automount delays or  
> failures with a home folder would mean that the SSH keys would not  
> be accessible and that would cause the login/authentication issues  
> I've been seeing.
>
> Is that a valid guess or am I grasping at straws here? I'm going to  
> recommend that automount be replaced with a static mount of /home  
> before we try to reproduce the error.

yes, this can be. The slave task on a node, i.e. the started sshd,  
can't access the ~/.ssh/authorized_keys file and refuses the connection.

In one/some of the nodes you can set in /etc/ssh/sshd_config the  
"Loglevel VERBOSE". At least on the client side -v -v -v will list  
the found ssh-keys. Maybe the daemon will do the same for login  
attempts, and will state that there was no authorized_keys file found.

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list