[GE users] Seemingly random node crashes
adam at greenhodge.net
Fri Apr 23 19:35:13 BST 2010
My department bought an Apple server cluster (one head node, 5 daughter nodes), and installed SGE on it. Now we are having problems with nodes randomly crashing. There isn't any immediately obvious pattern, except that they are always running something when the crash happens.
The crash itself is the interesting bit: the nodes themselves still function as machines, they are just incapable of communication over the network *via ssh only*. They can still ping other computers, but they can't be SSHed into from another computer. The SSH always hangs right after displaying the post-login info (eg "Last login: Fri Apr 9 14:13:30 2010 from head.cluster.private"). Furthermore, there are artifacts in the system when we manually connect a screen to the downed nodes. We can log in to a downed node perfectly fine through Finder when it is physically connected to a screen and keyboard, but when we start up the Terminal app, the session hangs right after login (just like when trying to SSH in). I managed to get around this by going into the terminal preferences and telling it to use the command 'login -l' at startup and unchecking 'Run inside shell' (if it runs it inside the startup shell, the 'login -l' command hangs). I can also get around terminal not working by using the X11 terminal.
Also, when a node is down in the manner I described, Activity Monitor on that node will crash, if a job is selected and then the 'info' button is hit.
Obviously this is a UNIX level problem -- still, it ONLY happens when running processes through SGE (albeit intermittently). The symptoms I listed above seem diffuse and mostly unrelated, but clearly there's gotta be something tying it all together. Any ideas?
Thanks ahead of time!
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users