[GE users] LOTS of GE 6.1u2: Job 1157200 failed

Bevan C. Bennett bevan at fulcrummicro.com
Wed Jul 23 00:04:43 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Ok, we're still getting a few of these, but I've narrowed things down a
bit. The first time a job fails it looks like this (excerpt):

07/21/2008 05:15:22 [0:23147]: write_to_qrsh - address = iodine:39168
07/21/2008 05:15:22 [0:23147]: write_to_qrsh - host = iodine, port = 39168
07/21/2008 05:15:22 [0:23147]: waiting for connection.
07/21/2008 05:15:22 [0:23147]: accepted connection on fd 2
07/21/2008 05:15:22 [0:23147]: daemon to start: |/usr/sbin/sshd-grid -i|
07/21/2008 05:15:30 [5143:23146]: wait3 returned 23147 (status: 65280;
WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 255)
07/21/2008 05:15:30 [5143:23146]: job exited with exit status 255
07/21/2008 05:15:30 [5143:23146]: reaped "job" with pid 23147
07/21/2008 05:15:30 [5143:23146]: job exited not due to signal
07/21/2008 05:15:30 [5143:23146]: job exited with status 255

After that it sets the QUEUE to error and tried to run in a different
queue, at which point it looks like this:

07/21/2008 05:15:49 [0:32305]: write_to_qrsh - address = iodine:39168
07/21/2008 05:15:49 [0:32305]: write_to_qrsh - host = iodine, port = 39168
07/21/2008 05:15:49 [0:32305]: error connecting stream socket:
Connection refused
07/21/2008 05:15:49 [0:32305]: communication with qrsh failed
07/21/2008 05:15:49 [0:32305]: forked "job" with pid 0
07/21/2008 05:15:49 [0:32305]: child: job - pid: 0
07/21/2008 05:15:49 [0:32305]: wait3 returned -1

It then sets that queue to error, and proceeds to do the same thing to
every available queue in turn, bringing the whole cluster down.

The problem looks to be that the initial node sshd fails to start
somehow and then the socket gets locked up. This is a really horrible
failure mode, since it takes the entire cluster down whenever it occurs.

Any good ideas of where we can look to help pin this down further, find
a workaround, or (even better) solve the issue completely?

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list