[GE users] Seemingly random node crashes

craffi dag at sonsorol.org
Fri Apr 23 19:49:40 BST 2010


I see this on Apple clusters when the shared fileserver is totally 
overwhelmed. The SSH login failure happens because the user home 
directory is no longer accessible because the share mount is hung, 
frozen or gone.

Is there a chance that your problem is caused by your NFS server falling 
over?

-Chris



biostat wrote:
> My department bought an Apple server cluster (one head node, 5 daughter nodes), and installed SGE on it. Now we are having problems with nodes randomly crashing. There isn't any immediately obvious pattern, except that they are always running something when the crash happens.
>
> The crash itself is the interesting bit: the nodes themselves still function as machines, they are just incapable of communication over the network *via ssh only*. They can still ping other computers, but they can't be SSHed into from another computer. The SSH always hangs right after displaying the post-login info (eg "Last login: Fri Apr  9 14:13:30 2010 from head.cluster.private"). Furthermore, there are artifacts in the system when we manually connect a screen to the downed nodes. We can log in to a downed node perfectly fine through Finder when it is physically connected to a screen and keyboard, but when we start up the Terminal app, the session hangs right after login (just like when trying to SSH in). I managed to get around this by going into the terminal preferences and telling it to use the command 'login -l' at startup and unchecking 'Run inside shell' (if it runs it inside the startup shell, the 'login -l' command hangs). I can also get around terminal not work
ing by using the X11 terminal.
>
> Also, when a node is down in the manner I described, Activity Monitor on that node will crash, if a job is selected and then the 'info' button is hit.
>
> Obviously this is a UNIX level problem -- still, it ONLY happens when running processes through SGE (albeit intermittently). The symptoms I listed above seem diffuse and mostly unrelated, but clearly there's gotta be something tying it all together. Any ideas?
>
> Thanks ahead of time!
> Adam
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=254649
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=254652

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list