[GE users] Scheduler dies like a hell

Ron Chen ron_chen_123 at yahoo.com
Fri May 20 03:01:56 BST 2005


It is not easy to find out which file gets corrupted
:(

One thing you can try is to move spooled job files (in
default/spool/qmaster/jobs) to a backup directory.
Also, you can use qconf to dump the configuration for
the queues/users/hosts, and see if the values "make
sense".

Of course the best way to fix this is to restore from
backup!

 -Ron


--- Viktor Oudovenko <udo at physics.rutgers.edu> wrote:
> Hi, Ron,
> 
> I am using classic spooling.
> Which file should I look for corruption? Can I edit
> it manually?
> Thank you very much in advance.
> v
> 
> > -----Original Message-----
> > From: Ron Chen [mailto:ron_chen_123 at yahoo.com] 
> > Sent: Thursday, May 19, 2005 20:38
> > To: users at gridengine.sunsource.net
> > Subject: RE: [GE users] Scheduler dies like a hell
> > 
> > 
> > Are you using classic spooling or Berkeley DB
> > spooling?
> > 
> > With classic spooling, when the machine crashes,
> the
> > files may get corrupted. And when qmaster reads in
> the 
> > corrupted files, it may also corrupt the qmasters'
> data structures.
> > 
> > IIRC, Berkeley DB handles recovery itself, but I
> have
> > never played with it myself :)
> > 
> >  -Ron
> > 
> > 
> > --- Viktor Oudovenko <udo at physics.rutgers.edu>
> wrote:
> > > Hi, Mac,
> > > Thank you very much for your advices!
> > > I'll try. I think one of running or finished
> jobs
> > > did a bad record somewhere
> > > (like jobs directory).
> > > Best regards,
> > > v
> > > 
> > > > -----Original Message-----
> > > > From: McCalla, Mac
> [mailto:macmccalla at hess.com]
> > > > Sent: Thursday, May 19, 2005 15:12
> > > > To: users at gridengine.sunsource.net
> > > > Subject: RE: [GE users] Scheduler dies like a
> hell
> > > > 
> > > > 
> > > > Hi,
> > > > 
> > > > Some thinks to look at:  any messages in
> > > > $SGE_ROOT/......../qmaster/schedd/messages  ?
> To
> > > get more
> > > > info about what scheduler is doing while it is
> > > running, see
> > > > info about scheduler params profile and
> monitor,
> > > you can set
> > > > them equal to 1 to turn on
> > > > some scheduler diagnostics,  see man
> sched_conf.  
> > > 
> > > > To extend timeout value for scheduler you can
> set
> > > > qmaster_params SCHEDULER_TIMEOUT to some value
> > > greater than
> > > > 600 (seconds).
> > > > You can also use system command strace to get
> > > trace of
> > > > scheduler activity while it is running to
> perhaps
> > > get a
> > > > better idea of what it is spending its time
> doing.
> > > > 
> > > > Hope this helps,
> > > > 
> > > > mac mccalla
> > > > 
> > > > -----Original Message-----
> > > > From: Viktor Oudovenko
> > > [mailto:udo at physics.rutgers.edu]
> > > > Sent: Thursday, May 19, 2005 12:00 PM
> > > > To: users at gridengine.sunsource.net
> > > > Subject: [GE users] Scheduler dies like a hell
> > > > 
> > > > Hi, everybody,
> > > > 
> > > > I am asking your help and ideas what could be
> done
> > > to restore
> > > > normal operation of the scheduler. First what
> > > happened. A few
> > > > time during last week our main server died and
> I
> > > needed to
> > > > reboot it and even replace it. But jobs which
> used
> > > automount
> > > > proceed run. But from yesterday or day before
> > > yesterday
> > > > scheduler demon dies. I tried to restart
> > > sge_master but it
> > > > did not help. Now when demon died I start it
> > > manually simply typing:
> > > > 
> > > > /opt/SGE/bin/lx24-x86/sge_schedd
> > > > 
> > > > but after some time it died again. Please
> advice
> > > what could it be?
> > > > 
> > > > Below plz find some info form file messages:
> > > > 
> > > > 
> > > > 05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no
> execd
> > > known on
> > > > host sub04n87 to send conf notification
> 05/19/2005
> > > 
> > > > 01:02:37|qmaster|rupc-cs04b|E|no execd known
> on
> > > host sub04n88
> > > > to send conf notification 05/19/2005
> > > > 01:02:37|qmaster|rupc-cs04b|E|no execd known
> on
> > > host sub04n89
> > > > to send conf notification 05/19/2005
> > > > 01:02:37|qmaster|rupc-cs04b|E|no execd known
> on
> > > host sub04n90
> > > > to send conf notification 05/19/2005
> > > > 01:02:37|qmaster|rupc-cs04b|E|no execd known
> on
> > > host sub04n91
> > > > to send conf notification 05/19/2005
> > > > 01:02:37|qmaster|rupc-cs04b|E|no execd known
> on
> > > host
> > > > rupc04.rutgers.edu to send conf notification
> > > 05/19/2005
> > > > 01:02:37|qmaster|rupc-cs04b|I|starting up
> 6.0u3
> > > 05/19/2005
> > > > 01:08:11|qmaster|rupc-cs04b|E|commlib error:
> got
> > > read error
> > > > (closing connection) 05/19/2005
> > > > 01:11:06|qmaster|rupc-cs04b|E|event client
> > > "scheduler"
> > > > (rupc-cs04b/schedd/1) reregistered - it will
> need
> > > a total
> > > > update 05/19/2005
> > > 01:24:31|qmaster|rupc-cs04b|W|job 21171.1
> > > > failed on host sub04n203 assumedly after job
> > > because: job
> > > > 21171.1 died through signal TERM
> > > > (15)
> > > > 05/19/2005
> > > 05:17:19|qmaster|rupc-cs04b|E|acknowledge
> timeout
> > > > after 600 seconds for event client (schedd:1)
> on
> > > host
> > > > "rupc-cs04b" 05/19/2005
> > > 09:29:03|qmaster|rupc-cs04b|W|job
> > > > 21060.1 failed on host sub04n74 assumedly
> after
> > > job because:
> > > > job 21060.1 died through signal TERM (15)
> > > 05/19/2005
> > > > 09:30:37|qmaster|rupc-cs04b|E|event client
> > > "scheduler"
> > > > (rupc-cs04b/schedd/1) reregistered - it will
> need
> > > a total
> > > > update 05/19/2005
> > > 11:04:21|qmaster|rupc-cs04b|W|job 20222.1
> > > > failed on host sub04n29 assumedly after job
> > > because: job
> > > > 20222.1 died through signal KILL (9)
> 05/19/2005
> > > > 11:05:50|qmaster|rupc-cs04b|W|job 21212.1
> failed
> > > on host
> > > > sub04n25 assumedly after job because: job
> 21212.1
> > > died
> > > > through signal KILL (9) 05/19/2005
> > > > 12:04:51|qmaster|rupc-cs04b|E|acknowledge
> timeout
> > > after 600
> > > > seconds for event client (schedd:1) on host
> > > "rupc-cs04b"
> 
=== message truncated ===



		
Discover Yahoo! 
Have fun online with music videos, cool games, IM and more. Check it out! 
http://discover.yahoo.com/online.html

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list