[GE users] LOTS of GE 6.1u2: Job 1157200 failed

Reuti reuti at staff.uni-marburg.de
Wed Jul 23 00:30:03 BST 2008


Am 23.07.2008 um 01:04 schrieb Bevan C. Bennett:

> Ok, we're still getting a few of these, but I've narrowed things  
> down a
> bit. The first time a job fails it looks like this (excerpt):
>
> 07/21/2008 05:15:22 [0:23147]: write_to_qrsh - address = iodine:39168
> 07/21/2008 05:15:22 [0:23147]: write_to_qrsh - host = iodine, port  
> = 39168
> 07/21/2008 05:15:22 [0:23147]: waiting for connection.
> 07/21/2008 05:15:22 [0:23147]: accepted connection on fd 2
> 07/21/2008 05:15:22 [0:23147]: daemon to start: |/usr/sbin/sshd- 
> grid -i|
> 07/21/2008 05:15:30 [5143:23146]: wait3 returned 23147 (status: 65280;
> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 255)
> 07/21/2008 05:15:30 [5143:23146]: job exited with exit status 255
> 07/21/2008 05:15:30 [5143:23146]: reaped "job" with pid 23147
> 07/21/2008 05:15:30 [5143:23146]: job exited not due to signal
> 07/21/2008 05:15:30 [5143:23146]: job exited with status 255
>
> After that it sets the QUEUE to error and tried to run in a different
> queue, at which point it looks like this:
>
> 07/21/2008 05:15:49 [0:32305]: write_to_qrsh - address = iodine:39168
> 07/21/2008 05:15:49 [0:32305]: write_to_qrsh - host = iodine, port  
> = 39168
> 07/21/2008 05:15:49 [0:32305]: error connecting stream socket:
> Connection refused
> 07/21/2008 05:15:49 [0:32305]: communication with qrsh failed
> 07/21/2008 05:15:49 [0:32305]: forked "job" with pid 0
> 07/21/2008 05:15:49 [0:32305]: child: job - pid: 0
> 07/21/2008 05:15:49 [0:32305]: wait3 returned -1
>
> It then sets that queue to error, and proceeds to do the same thing to
> every available queue in turn, bringing the whole cluster down.
>
> The problem looks to be that the initial node sshd fails to start
> somehow and then the socket gets locked up. This is a really horrible
> failure mode, since it takes the entire cluster down whenever it  
> occurs.
>
> Any good ideas of where we can look to help pin this down further,  
> find
> a workaround, or (even better) solve the issue completely?

Any of the file systems are full?

Is ssh-grid "special"?

Is it a private cluster and you could even use SGE's rsh?

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list