[GE users] Shepherd errors from time to time on SGE_6u10

Rayson Ho rayrayson at gmail.com
Tue May 15 23:02:12 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Can you provide more info??

Things like OS version, hardware info, and is $SGE_ROOT shared, and
shared by what... would help.

Rayson



On 5/15/07, Schenker, Martin <MSchenker at illumina.com> wrote:
> Hi all!
>
> We're getting errors like this in our queue. They appear occasionally and bring down the whole job. They are not bound to a machine, but can appear on any of our 16 nodes. If the job is restarted, all goes well. So it looks like an intermittand problem, but I haven't found anything wrong with the setup. The queue is running 24/7, why do some jobs create errors like this:
>
> (from the job logs, three examples)
>
> cannot get connection to "shepherd" at host "mondas7"
> qmake[1]: *** [s_7_0146_int.txt] Error 1
> qmake[1]: *** Waiting for unfinished jobs....
> ssh_exchange_identification: Connection closed by remote host
> qmake[1]: *** [s_7_0149_int.txt] Error 129
> qmake[1]: *** [s_7_0134_int.txt] Error 137
> qmake: *** [nonrecursive] Error 2
>
> cannot get connection to "shepherd" at host "mondas9"
> qmake[2]: *** [s_3_0057_align.txt] Error 1
> qmake[2]: *** Waiting for unfinished jobs....
> qmake[2]: *** [s_3_0071_align.txt] Error 137
> qmake[1]: *** [GERALD_09-05-2007] Error 2
> qmake: *** [Bustard1.8.28_09-05-2007] Error 2
>
> cannot get connection to "shepherd" at host "mondas11"
> qmake[1]: *** [Matrix/s_2_0026_02_mat.txt] Error 1
> qmake[1]: *** Waiting for unfinished jobs....
> cannot get connection to "shepherd" at host "mondas11"
> qmake[1]: *** [Matrix/s_2_0027_02_mat.txt] Error 1
> qmake[1]: *** [s_8_0033_int.txt] Error 137
> qmake: *** [nonrecursive] Error 2
>
> I hoped after moving from 6u8 to 6u10 this would disappear, but so far we've only seen a reduction in error. We're still getting a shepherd error every 2-3 days. Is there anything we can check or some hints how to fix this?
>
> Best, Martin
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list