[GE users] Shepherd errors from time to time on SGE_6u10

Yogesh Bhanu yogesh at gsf.de
Wed May 16 13:27:55 BST 2007


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,
    We had similar kind of issues with IBM GPFS on opterons.
my jobs died every 10 hours . It took us some time before we could narrow
down the problem to fs .

We moved SGE to NFS and all has been well since then.

My two cents ,
yogesh

PS: we dont use GPFS any more .
Another thought if your nodes are on dhcp try changing lease time, to see if
it helps.


Schenker, Martin wrote:
> Sorry,
> 
> try to dig this out. 
> 
> We're running CentOS 4.1 on 15 HP DL145 G2 (4G RAM; dual proc, dual core) with a DL385 headnode (8G RAM, dual proc, dual core). All machines have a locally mounted HP_SFS (Lustre) filesystem where the $SGE_ROOT is located.
> I can't see any coincidence (by timestamp) that the filesystem isn't responding properly when the shepherd errors occur. I'd like to know what could cause a shepherd error so we can try to track this down.
> 
> Best, Martin
> 
> -----Original Message-----
> From: Rayson Ho [mailto:rayrayson at gmail.com]
> Sent: 15 May 2007 23:02
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Shepherd errors from time to time on SGE_6u10
> 
> 
> Can you provide more info??
> 
> Things like OS version, hardware info, and is $SGE_ROOT shared, and
> shared by what... would help.
> 
> Rayson
> 
> 
> 
> On 5/15/07, Schenker, Martin <MSchenker at illumina.com> wrote:
>> Hi all!
>>
>> We're getting errors like this in our queue. They appear occasionally and bring down the whole job. They are not bound to a machine, but can appear on any of our 16 nodes. If the job is restarted, all goes well. So it looks like an intermittand problem, but I haven't found anything wrong with the setup. The queue is running 24/7, why do some jobs create errors like this:
>>
>> (from the job logs, three examples)
>>
>> cannot get connection to "shepherd" at host "mondas7"
>> qmake[1]: *** [s_7_0146_int.txt] Error 1
>> qmake[1]: *** Waiting for unfinished jobs....
>> ssh_exchange_identification: Connection closed by remote host
>> qmake[1]: *** [s_7_0149_int.txt] Error 129
>> qmake[1]: *** [s_7_0134_int.txt] Error 137
>> qmake: *** [nonrecursive] Error 2
>>
>> cannot get connection to "shepherd" at host "mondas9"
>> qmake[2]: *** [s_3_0057_align.txt] Error 1
>> qmake[2]: *** Waiting for unfinished jobs....
>> qmake[2]: *** [s_3_0071_align.txt] Error 137
>> qmake[1]: *** [GERALD_09-05-2007] Error 2
>> qmake: *** [Bustard1.8.28_09-05-2007] Error 2
>>
>> cannot get connection to "shepherd" at host "mondas11"
>> qmake[1]: *** [Matrix/s_2_0026_02_mat.txt] Error 1
>> qmake[1]: *** Waiting for unfinished jobs....
>> cannot get connection to "shepherd" at host "mondas11"
>> qmake[1]: *** [Matrix/s_2_0027_02_mat.txt] Error 1
>> qmake[1]: *** [s_8_0033_int.txt] Error 137
>> qmake: *** [nonrecursive] Error 2
>>
>> I hoped after moving from 6u8 to 6u10 this would disappear, but so far we've only seen a reduction in error. We're still getting a shepherd error every 2-3 days. Is there anything we can check or some hints how to fix this?
>>
>> Best, Martin
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list