[GE users] Scheduler dies like a hell

Viktor Oudovenko udo at physics.rutgers.edu
Thu May 19 20:28:50 BST 2005


Hi, Mac,
Thank you very much for your advices!
I'll try. I think one of running or finished jobs did a bad record somewhere
(like jobs directory).
Best regards,
v

> -----Original Message-----
> From: McCalla, Mac [mailto:macmccalla at hess.com] 
> Sent: Thursday, May 19, 2005 15:12
> To: users at gridengine.sunsource.net
> Subject: RE: [GE users] Scheduler dies like a hell
> 
> 
> Hi,
> 
> Some thinks to look at:  any messages in 
> $SGE_ROOT/......../qmaster/schedd/messages  ? To get more 
> info about what scheduler is doing while it is running, see 
> info about scheduler params profile and monitor, you can set 
> them equal to 1 to turn on 
> some scheduler diagnostics,  see man sched_conf.   
> To extend timeout value for scheduler you can set 
> qmaster_params SCHEDULER_TIMEOUT to some value greater than 
> 600 (seconds). 
> You can also use system command strace to get trace of 
> scheduler activity while it is running to perhaps get a 
> better idea of what it is spending its time doing.
> 
> Hope this helps,
> 
> mac mccalla  
> 
> -----Original Message-----
> From: Viktor Oudovenko [mailto:udo at physics.rutgers.edu] 
> Sent: Thursday, May 19, 2005 12:00 PM
> To: users at gridengine.sunsource.net
> Subject: [GE users] Scheduler dies like a hell
> 
> Hi, everybody,
> 
> I am asking your help and ideas what could be done to restore 
> normal operation of the scheduler. First what happened. A few 
> time during last week our main server died and I needed to 
> reboot it and even replace it. But jobs which used automount  
> proceed run. But from yesterday or day before yesterday 
> scheduler demon dies. I tried to restart sge_master but it 
> did not help. Now when demon died I start it manually simply typing:
> 
> /opt/SGE/bin/lx24-x86/sge_schedd
> 
> but after some time it died again. Please advice what could it be?
> 
> Below plz find some info form file messages:
> 
> 
> 05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no execd known on 
> host sub04n87 to send conf notification 05/19/2005 
> 01:02:37|qmaster|rupc-cs04b|E|no execd known on host sub04n88 
> to send conf notification 05/19/2005 
> 01:02:37|qmaster|rupc-cs04b|E|no execd known on host sub04n89 
> to send conf notification 05/19/2005 
> 01:02:37|qmaster|rupc-cs04b|E|no execd known on host sub04n90 
> to send conf notification 05/19/2005 
> 01:02:37|qmaster|rupc-cs04b|E|no execd known on host sub04n91 
> to send conf notification 05/19/2005 
> 01:02:37|qmaster|rupc-cs04b|E|no execd known on host 
> rupc04.rutgers.edu to send conf notification 05/19/2005 
> 01:02:37|qmaster|rupc-cs04b|I|starting up 6.0u3 05/19/2005 
> 01:08:11|qmaster|rupc-cs04b|E|commlib error: got read error 
> (closing connection) 05/19/2005 
> 01:11:06|qmaster|rupc-cs04b|E|event client "scheduler"
> (rupc-cs04b/schedd/1) reregistered - it will need a total 
> update 05/19/2005 01:24:31|qmaster|rupc-cs04b|W|job 21171.1 
> failed on host sub04n203 assumedly after job because: job 
> 21171.1 died through signal TERM
> (15)
> 05/19/2005 05:17:19|qmaster|rupc-cs04b|E|acknowledge timeout 
> after 600 seconds for event client (schedd:1) on host 
> "rupc-cs04b" 05/19/2005 09:29:03|qmaster|rupc-cs04b|W|job 
> 21060.1 failed on host sub04n74 assumedly after job because: 
> job 21060.1 died through signal TERM (15) 05/19/2005 
> 09:30:37|qmaster|rupc-cs04b|E|event client "scheduler"
> (rupc-cs04b/schedd/1) reregistered - it will need a total 
> update 05/19/2005 11:04:21|qmaster|rupc-cs04b|W|job 20222.1 
> failed on host sub04n29 assumedly after job because: job 
> 20222.1 died through signal KILL (9) 05/19/2005 
> 11:05:50|qmaster|rupc-cs04b|W|job 21212.1 failed on host 
> sub04n25 assumedly after job because: job 21212.1 died 
> through signal KILL (9) 05/19/2005 
> 12:04:51|qmaster|rupc-cs04b|E|acknowledge timeout after 600 
> seconds for event client (schedd:1) on host "rupc-cs04b"
> 
> 
> At 01:02:37 I restarted sgemaster.
> 
> thank you very much for any information and help.
> 
> regards, viktor
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list