[GE users] Scheduler dies like a hell

Viktor Oudovenko udo at physics.rutgers.edu
Fri May 20 02:31:43 BST 2005


Hi, Ron,

I am using classic spooling.
Which file should I look for corruption? Can I edit it manually?
Thank you very much in advance.
v

> -----Original Message-----
> From: Ron Chen [mailto:ron_chen_123 at yahoo.com] 
> Sent: Thursday, May 19, 2005 20:38
> To: users at gridengine.sunsource.net
> Subject: RE: [GE users] Scheduler dies like a hell
> 
> 
> Are you using classic spooling or Berkeley DB
> spooling?
> 
> With classic spooling, when the machine crashes, the
> files may get corrupted. And when qmaster reads in the 
> corrupted files, it may also corrupt the qmasters' data structures.
> 
> IIRC, Berkeley DB handles recovery itself, but I have
> never played with it myself :)
> 
>  -Ron
> 
> 
> --- Viktor Oudovenko <udo at physics.rutgers.edu> wrote:
> > Hi, Mac,
> > Thank you very much for your advices!
> > I'll try. I think one of running or finished jobs
> > did a bad record somewhere
> > (like jobs directory).
> > Best regards,
> > v
> > 
> > > -----Original Message-----
> > > From: McCalla, Mac [mailto:macmccalla at hess.com]
> > > Sent: Thursday, May 19, 2005 15:12
> > > To: users at gridengine.sunsource.net
> > > Subject: RE: [GE users] Scheduler dies like a hell
> > > 
> > > 
> > > Hi,
> > > 
> > > Some thinks to look at:  any messages in
> > > $SGE_ROOT/......../qmaster/schedd/messages  ? To
> > get more
> > > info about what scheduler is doing while it is
> > running, see
> > > info about scheduler params profile and monitor,
> > you can set
> > > them equal to 1 to turn on
> > > some scheduler diagnostics,  see man sched_conf.  
> > 
> > > To extend timeout value for scheduler you can set
> > > qmaster_params SCHEDULER_TIMEOUT to some value
> > greater than
> > > 600 (seconds).
> > > You can also use system command strace to get
> > trace of
> > > scheduler activity while it is running to perhaps
> > get a
> > > better idea of what it is spending its time doing.
> > > 
> > > Hope this helps,
> > > 
> > > mac mccalla
> > > 
> > > -----Original Message-----
> > > From: Viktor Oudovenko
> > [mailto:udo at physics.rutgers.edu]
> > > Sent: Thursday, May 19, 2005 12:00 PM
> > > To: users at gridengine.sunsource.net
> > > Subject: [GE users] Scheduler dies like a hell
> > > 
> > > Hi, everybody,
> > > 
> > > I am asking your help and ideas what could be done
> > to restore
> > > normal operation of the scheduler. First what
> > happened. A few
> > > time during last week our main server died and I
> > needed to
> > > reboot it and even replace it. But jobs which used
> > automount
> > > proceed run. But from yesterday or day before
> > yesterday
> > > scheduler demon dies. I tried to restart
> > sge_master but it
> > > did not help. Now when demon died I start it
> > manually simply typing:
> > > 
> > > /opt/SGE/bin/lx24-x86/sge_schedd
> > > 
> > > but after some time it died again. Please advice
> > what could it be?
> > > 
> > > Below plz find some info form file messages:
> > > 
> > > 
> > > 05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no execd
> > known on
> > > host sub04n87 to send conf notification 05/19/2005
> > 
> > > 01:02:37|qmaster|rupc-cs04b|E|no execd known on
> > host sub04n88
> > > to send conf notification 05/19/2005
> > > 01:02:37|qmaster|rupc-cs04b|E|no execd known on
> > host sub04n89
> > > to send conf notification 05/19/2005
> > > 01:02:37|qmaster|rupc-cs04b|E|no execd known on
> > host sub04n90
> > > to send conf notification 05/19/2005
> > > 01:02:37|qmaster|rupc-cs04b|E|no execd known on
> > host sub04n91
> > > to send conf notification 05/19/2005
> > > 01:02:37|qmaster|rupc-cs04b|E|no execd known on
> > host
> > > rupc04.rutgers.edu to send conf notification
> > 05/19/2005
> > > 01:02:37|qmaster|rupc-cs04b|I|starting up 6.0u3
> > 05/19/2005
> > > 01:08:11|qmaster|rupc-cs04b|E|commlib error: got
> > read error
> > > (closing connection) 05/19/2005
> > > 01:11:06|qmaster|rupc-cs04b|E|event client
> > "scheduler"
> > > (rupc-cs04b/schedd/1) reregistered - it will need
> > a total
> > > update 05/19/2005
> > 01:24:31|qmaster|rupc-cs04b|W|job 21171.1
> > > failed on host sub04n203 assumedly after job
> > because: job
> > > 21171.1 died through signal TERM
> > > (15)
> > > 05/19/2005
> > 05:17:19|qmaster|rupc-cs04b|E|acknowledge timeout
> > > after 600 seconds for event client (schedd:1) on
> > host
> > > "rupc-cs04b" 05/19/2005
> > 09:29:03|qmaster|rupc-cs04b|W|job
> > > 21060.1 failed on host sub04n74 assumedly after
> > job because:
> > > job 21060.1 died through signal TERM (15)
> > 05/19/2005
> > > 09:30:37|qmaster|rupc-cs04b|E|event client
> > "scheduler"
> > > (rupc-cs04b/schedd/1) reregistered - it will need
> > a total
> > > update 05/19/2005
> > 11:04:21|qmaster|rupc-cs04b|W|job 20222.1
> > > failed on host sub04n29 assumedly after job
> > because: job
> > > 20222.1 died through signal KILL (9) 05/19/2005
> > > 11:05:50|qmaster|rupc-cs04b|W|job 21212.1 failed
> > on host
> > > sub04n25 assumedly after job because: job 21212.1
> > died
> > > through signal KILL (9) 05/19/2005
> > > 12:04:51|qmaster|rupc-cs04b|E|acknowledge timeout
> > after 600
> > > seconds for event client (schedd:1) on host
> > "rupc-cs04b"
> > > 
> > > 
> > > At 01:02:37 I restarted sgemaster.
> > > 
> > > thank you very much for any information and help.
> > > 
> > > regards, viktor
> > > 
> > > 
> > >
> >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> > users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail:
> > users-help at gridengine.sunsource.net
> > > 
> > > 
> > > 
> > >
> >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> > users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail:
> > users-help at gridengine.sunsource.net
> > > 
> > 
> > 
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail:
> > users-help at gridengine.sunsource.net
> > 
> > 
> 
> 
> 
> 		
> __________________________________ 
> Do you Yahoo!? 
> Make Yahoo! your home page 
> http://www.yahoo.com/r/hs
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list