[GE users] Scheduler dies like a hell

Ron Chen ron_chen_123 at yahoo.com
Fri May 20 01:37:41 BST 2005


Are you using classic spooling or Berkeley DB
spooling?

With classic spooling, when the machine crashes, the
files may get corrupted. And when qmaster reads in the
corrupted files, it may also corrupt the qmasters'
data structures.

IIRC, Berkeley DB handles recovery itself, but I have
never played with it myself :)

 -Ron


--- Viktor Oudovenko <udo at physics.rutgers.edu> wrote:
> Hi, Mac,
> Thank you very much for your advices!
> I'll try. I think one of running or finished jobs
> did a bad record somewhere
> (like jobs directory).
> Best regards,
> v
> 
> > -----Original Message-----
> > From: McCalla, Mac [mailto:macmccalla at hess.com] 
> > Sent: Thursday, May 19, 2005 15:12
> > To: users at gridengine.sunsource.net
> > Subject: RE: [GE users] Scheduler dies like a hell
> > 
> > 
> > Hi,
> > 
> > Some thinks to look at:  any messages in 
> > $SGE_ROOT/......../qmaster/schedd/messages  ? To
> get more 
> > info about what scheduler is doing while it is
> running, see 
> > info about scheduler params profile and monitor,
> you can set 
> > them equal to 1 to turn on 
> > some scheduler diagnostics,  see man sched_conf.  
> 
> > To extend timeout value for scheduler you can set 
> > qmaster_params SCHEDULER_TIMEOUT to some value
> greater than 
> > 600 (seconds). 
> > You can also use system command strace to get
> trace of 
> > scheduler activity while it is running to perhaps
> get a 
> > better idea of what it is spending its time doing.
> > 
> > Hope this helps,
> > 
> > mac mccalla  
> > 
> > -----Original Message-----
> > From: Viktor Oudovenko
> [mailto:udo at physics.rutgers.edu] 
> > Sent: Thursday, May 19, 2005 12:00 PM
> > To: users at gridengine.sunsource.net
> > Subject: [GE users] Scheduler dies like a hell
> > 
> > Hi, everybody,
> > 
> > I am asking your help and ideas what could be done
> to restore 
> > normal operation of the scheduler. First what
> happened. A few 
> > time during last week our main server died and I
> needed to 
> > reboot it and even replace it. But jobs which used
> automount  
> > proceed run. But from yesterday or day before
> yesterday 
> > scheduler demon dies. I tried to restart
> sge_master but it 
> > did not help. Now when demon died I start it
> manually simply typing:
> > 
> > /opt/SGE/bin/lx24-x86/sge_schedd
> > 
> > but after some time it died again. Please advice
> what could it be?
> > 
> > Below plz find some info form file messages:
> > 
> > 
> > 05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no execd
> known on 
> > host sub04n87 to send conf notification 05/19/2005
> 
> > 01:02:37|qmaster|rupc-cs04b|E|no execd known on
> host sub04n88 
> > to send conf notification 05/19/2005 
> > 01:02:37|qmaster|rupc-cs04b|E|no execd known on
> host sub04n89 
> > to send conf notification 05/19/2005 
> > 01:02:37|qmaster|rupc-cs04b|E|no execd known on
> host sub04n90 
> > to send conf notification 05/19/2005 
> > 01:02:37|qmaster|rupc-cs04b|E|no execd known on
> host sub04n91 
> > to send conf notification 05/19/2005 
> > 01:02:37|qmaster|rupc-cs04b|E|no execd known on
> host 
> > rupc04.rutgers.edu to send conf notification
> 05/19/2005 
> > 01:02:37|qmaster|rupc-cs04b|I|starting up 6.0u3
> 05/19/2005 
> > 01:08:11|qmaster|rupc-cs04b|E|commlib error: got
> read error 
> > (closing connection) 05/19/2005 
> > 01:11:06|qmaster|rupc-cs04b|E|event client
> "scheduler"
> > (rupc-cs04b/schedd/1) reregistered - it will need
> a total 
> > update 05/19/2005
> 01:24:31|qmaster|rupc-cs04b|W|job 21171.1 
> > failed on host sub04n203 assumedly after job
> because: job 
> > 21171.1 died through signal TERM
> > (15)
> > 05/19/2005
> 05:17:19|qmaster|rupc-cs04b|E|acknowledge timeout 
> > after 600 seconds for event client (schedd:1) on
> host 
> > "rupc-cs04b" 05/19/2005
> 09:29:03|qmaster|rupc-cs04b|W|job 
> > 21060.1 failed on host sub04n74 assumedly after
> job because: 
> > job 21060.1 died through signal TERM (15)
> 05/19/2005 
> > 09:30:37|qmaster|rupc-cs04b|E|event client
> "scheduler"
> > (rupc-cs04b/schedd/1) reregistered - it will need
> a total 
> > update 05/19/2005
> 11:04:21|qmaster|rupc-cs04b|W|job 20222.1 
> > failed on host sub04n29 assumedly after job
> because: job 
> > 20222.1 died through signal KILL (9) 05/19/2005 
> > 11:05:50|qmaster|rupc-cs04b|W|job 21212.1 failed
> on host 
> > sub04n25 assumedly after job because: job 21212.1
> died 
> > through signal KILL (9) 05/19/2005 
> > 12:04:51|qmaster|rupc-cs04b|E|acknowledge timeout
> after 600 
> > seconds for event client (schedd:1) on host
> "rupc-cs04b"
> > 
> > 
> > At 01:02:37 I restarted sgemaster.
> > 
> > thank you very much for any information and help.
> > 
> > regards, viktor
> > 
> > 
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> > 
> > 
> > 
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> > 
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> 
> 



		
__________________________________ 
Do you Yahoo!? 
Make Yahoo! your home page 
http://www.yahoo.com/r/hs

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list