[GE users] Shepherd errors from time to time on SGE_6u10

Rayson Ho rayrayson at gmail.com
Wed May 16 15:29:22 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Yes, or Martin can use local execd spooling and won't lose anything.

Rayson



On 5/16/07, Yogesh Bhanu <yogesh at gsf.de> wrote:
> Hi,
>    We had similar kind of issues with IBM GPFS on opterons.
> my jobs died every 10 hours . It took us some time before we could narrow
> down the problem to fs .
>
> We moved SGE to NFS and all has been well since then.
>
> My two cents ,
> yogesh
>
> PS: we dont use GPFS any more .
> Another thought if your nodes are on dhcp try changing lease time, to see if
> it helps.
>
>
> Schenker, Martin wrote:
> > Sorry,
> >
> > try to dig this out.
> >
> > We're running CentOS 4.1 on 15 HP DL145 G2 (4G RAM; dual proc, dual core) with a DL385 headnode (8G RAM, dual proc, dual core). All machines have a locally mounted HP_SFS (Lustre) filesystem where the $SGE_ROOT is located.
> > I can't see any coincidence (by timestamp) that the filesystem isn't responding properly when the shepherd errors occur. I'd like to know what could cause a shepherd error so we can try to track this down.
> >
> > Best, Martin
> >
> > -----Original Message-----
> > From: Rayson Ho [mailto:rayrayson at gmail.com]
> > Sent: 15 May 2007 23:02
> > To: users at gridengine.sunsource.net
> > Subject: Re: [GE users] Shepherd errors from time to time on SGE_6u10
> >
> >
> > Can you provide more info??
> >
> > Things like OS version, hardware info, and is $SGE_ROOT shared, and
> > shared by what... would help.
> >
> > Rayson
> >
> >
> >
> > On 5/15/07, Schenker, Martin <MSchenker at illumina.com> wrote:
> >> Hi all!
> >>
> >> We're getting errors like this in our queue. They appear occasionally and bring down the whole job. They are not bound to a machine, but can appear on any of our 16 nodes. If the job is restarted, all goes well. So it looks like an intermittand problem, but I haven't found anything wrong with the setup. The queue is running 24/7, why do some jobs create errors like this:
> >>
> >> (from the job logs, three examples)
> >>
> >> cannot get connection to "shepherd" at host "mondas7"
> >> qmake[1]: *** [s_7_0146_int.txt] Error 1
> >> qmake[1]: *** Waiting for unfinished jobs....
> >> ssh_exchange_identification: Connection closed by remote host
> >> qmake[1]: *** [s_7_0149_int.txt] Error 129
> >> qmake[1]: *** [s_7_0134_int.txt] Error 137
> >> qmake: *** [nonrecursive] Error 2
> >>
> >> cannot get connection to "shepherd" at host "mondas9"
> >> qmake[2]: *** [s_3_0057_align.txt] Error 1
> >> qmake[2]: *** Waiting for unfinished jobs....
> >> qmake[2]: *** [s_3_0071_align.txt] Error 137
> >> qmake[1]: *** [GERALD_09-05-2007] Error 2
> >> qmake: *** [Bustard1.8.28_09-05-2007] Error 2
> >>
> >> cannot get connection to "shepherd" at host "mondas11"
> >> qmake[1]: *** [Matrix/s_2_0026_02_mat.txt] Error 1
> >> qmake[1]: *** Waiting for unfinished jobs....
> >> cannot get connection to "shepherd" at host "mondas11"
> >> qmake[1]: *** [Matrix/s_2_0027_02_mat.txt] Error 1
> >> qmake[1]: *** [s_8_0033_int.txt] Error 137
> >> qmake: *** [nonrecursive] Error 2
> >>
> >> I hoped after moving from 6u8 to 6u10 this would disappear, but so far we've only seen a reduction in error. We're still getting a shepherd error every 2-3 days. Is there anything we can check or some hints how to fix this?
> >>
> >> Best, Martin
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list