[GE users] Scheduler dies like a hell

Viktor Oudovenko udo at physics.rutgers.edu
Fri May 20 03:17:58 BST 2005


> It is not easy to find out which file gets corrupted
> :(

I just switched of reporting but it did not help.
Should I look into : accounting  file?

 
> One thing you can try is to move spooled job files (in
> default/spool/qmaster/jobs) to a backup directory.

But the jobs will be lost? Or I can move them back after sgemaster restart
and jobs reappear?


> Also, you can use qconf to dump the configuration for
> the queues/users/hosts, and see if the values "make
> sense".

Could you give me the command (plz) usually I use qmon to manage the SGE.
> 
> Of course the best way to fix this is to restore from
> backup!

A few days ago I made a copy of everything I can try to see whether the same
problem existed already.
And one more question: can one do backup with classic spooling? I meet
somewhere discussion that backup command did not work. Am I wrong?


Thank you very much for any advice.
Best regards,
v
 
>  -Ron
> 
> 
> --- Viktor Oudovenko <udo at physics.rutgers.edu> wrote:
> > Hi, Ron,
> > 
> > I am using classic spooling.
> > Which file should I look for corruption? Can I edit
> > it manually?
> > Thank you very much in advance.
> > v
> > 
> > > -----Original Message-----
> > > From: Ron Chen [mailto:ron_chen_123 at yahoo.com]
> > > Sent: Thursday, May 19, 2005 20:38
> > > To: users at gridengine.sunsource.net
> > > Subject: RE: [GE users] Scheduler dies like a hell
> > > 
> > > 
> > > Are you using classic spooling or Berkeley DB
> > > spooling?
> > > 
> > > With classic spooling, when the machine crashes,
> > the
> > > files may get corrupted. And when qmaster reads in
> > the
> > > corrupted files, it may also corrupt the qmasters'
> > data structures.
> > > 
> > > IIRC, Berkeley DB handles recovery itself, but I
> > have
> > > never played with it myself :)
> > > 
> > >  -Ron
> > > 
> > > 
> > > --- Viktor Oudovenko <udo at physics.rutgers.edu>
> > wrote:
> > > > Hi, Mac,
> > > > Thank you very much for your advices!
> > > > I'll try. I think one of running or finished
> > jobs
> > > > did a bad record somewhere
> > > > (like jobs directory).
> > > > Best regards,
> > > > v
> > > > 
> > > > > -----Original Message-----
> > > > > From: McCalla, Mac
> > [mailto:macmccalla at hess.com]
> > > > > Sent: Thursday, May 19, 2005 15:12
> > > > > To: users at gridengine.sunsource.net
> > > > > Subject: RE: [GE users] Scheduler dies like a
> > hell
> > > > > 
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > Some thinks to look at:  any messages in 
> > > > > $SGE_ROOT/......../qmaster/schedd/messages  ?
> > To
> > > > get more
> > > > > info about what scheduler is doing while it is
> > > > running, see
> > > > > info about scheduler params profile and
> > monitor,
> > > > you can set
> > > > > them equal to 1 to turn on
> > > > > some scheduler diagnostics,  see man
> > sched_conf.
> > > > 
> > > > > To extend timeout value for scheduler you can
> > set
> > > > > qmaster_params SCHEDULER_TIMEOUT to some value
> > > > greater than
> > > > > 600 (seconds).
> > > > > You can also use system command strace to get
> > > > trace of
> > > > > scheduler activity while it is running to
> > perhaps
> > > > get a
> > > > > better idea of what it is spending its time
> > doing.
> > > > > 
> > > > > Hope this helps,
> > > > > 
> > > > > mac mccalla
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Viktor Oudovenko
> > > > [mailto:udo at physics.rutgers.edu]
> > > > > Sent: Thursday, May 19, 2005 12:00 PM
> > > > > To: users at gridengine.sunsource.net
> > > > > Subject: [GE users] Scheduler dies like a hell
> > > > > 
> > > > > Hi, everybody,
> > > > > 
> > > > > I am asking your help and ideas what could be
> > done
> > > > to restore
> > > > > normal operation of the scheduler. First what
> > > > happened. A few
> > > > > time during last week our main server died and
> > I
> > > > needed to
> > > > > reboot it and even replace it. But jobs which
> > used
> > > > automount
> > > > > proceed run. But from yesterday or day before
> > > > yesterday
> > > > > scheduler demon dies. I tried to restart
> > > > sge_master but it
> > > > > did not help. Now when demon died I start it
> > > > manually simply typing:
> > > > > 
> > > > > /opt/SGE/bin/lx24-x86/sge_schedd
> > > > > 
> > > > > but after some time it died again. Please
> > advice
> > > > what could it be?
> > > > > 
> > > > > Below plz find some info form file messages:
> > > > > 
> > > > > 
> > > > > 05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no
> > execd
> > > > known on
> > > > > host sub04n87 to send conf notification
> > 05/19/2005
> > > > 
> > > > > 01:02:37|qmaster|rupc-cs04b|E|no execd known
> > on
> > > > host sub04n88
> > > > > to send conf notification 05/19/2005 
> > > > > 01:02:37|qmaster|rupc-cs04b|E|no execd known
> > on
> > > > host sub04n89
> > > > > to send conf notification 05/19/2005 
> > > > > 01:02:37|qmaster|rupc-cs04b|E|no execd known
> > on
> > > > host sub04n90
> > > > > to send conf notification 05/19/2005 
> > > > > 01:02:37|qmaster|rupc-cs04b|E|no execd known
> > on
> > > > host sub04n91
> > > > > to send conf notification 05/19/2005 
> > > > > 01:02:37|qmaster|rupc-cs04b|E|no execd known
> > on
> > > > host
> > > > > rupc04.rutgers.edu to send conf notification
> > > > 05/19/2005
> > > > > 01:02:37|qmaster|rupc-cs04b|I|starting up
> > 6.0u3
> > > > 05/19/2005
> > > > > 01:08:11|qmaster|rupc-cs04b|E|commlib error:
> > got
> > > > read error
> > > > > (closing connection) 05/19/2005 
> > > > > 01:11:06|qmaster|rupc-cs04b|E|event client
> > > > "scheduler"
> > > > > (rupc-cs04b/schedd/1) reregistered - it will
> > need
> > > > a total
> > > > > update 05/19/2005
> > > > 01:24:31|qmaster|rupc-cs04b|W|job 21171.1
> > > > > failed on host sub04n203 assumedly after job
> > > > because: job
> > > > > 21171.1 died through signal TERM
> > > > > (15)
> > > > > 05/19/2005
> > > > 05:17:19|qmaster|rupc-cs04b|E|acknowledge
> > timeout
> > > > > after 600 seconds for event client (schedd:1)
> > on
> > > > host
> > > > > "rupc-cs04b" 05/19/2005
> > > > 09:29:03|qmaster|rupc-cs04b|W|job
> > > > > 21060.1 failed on host sub04n74 assumedly
> > after
> > > > job because:
> > > > > job 21060.1 died through signal TERM (15)
> > > > 05/19/2005
> > > > > 09:30:37|qmaster|rupc-cs04b|E|event client
> > > > "scheduler"
> > > > > (rupc-cs04b/schedd/1) reregistered - it will
> > need
> > > > a total
> > > > > update 05/19/2005
> > > > 11:04:21|qmaster|rupc-cs04b|W|job 20222.1
> > > > > failed on host sub04n29 assumedly after job
> > > > because: job
> > > > > 20222.1 died through signal KILL (9)
> > 05/19/2005
> > > > > 11:05:50|qmaster|rupc-cs04b|W|job 21212.1
> > failed
> > > > on host
> > > > > sub04n25 assumedly after job because: job
> > 21212.1
> > > > died
> > > > > through signal KILL (9) 05/19/2005 
> > > > > 12:04:51|qmaster|rupc-cs04b|E|acknowledge
> > timeout
> > > > after 600
> > > > > seconds for event client (schedd:1) on host
> > > > "rupc-cs04b"
> > 
> === message truncated ===
> 
> 
> 
> 		
> Discover Yahoo! 
> Have fun online with music videos, cool games, IM and more. 
> Check it out! 
> http://discover.yahoo.com/online.html
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list