[GE users] Scheduler dies like a hell

McCalla, Mac macmccalla at hess.com
Thu May 19 20:11:47 BST 2005


Hi,

Some thinks to look at:  any messages in
$SGE_ROOT/......../qmaster/schedd/messages  ?
To get more info about what scheduler is doing while it is running, see
info about
scheduler params profile and monitor, you can set them equal to 1 to
turn on 
some scheduler diagnostics,  see man sched_conf.   
To extend timeout value for scheduler you can set qmaster_params
SCHEDULER_TIMEOUT to
some value greater than 600 (seconds). 
You can also use system command strace to get trace of scheduler
activity while it
is running to perhaps get a better idea of what it is spending its time
doing.

Hope this helps,

mac mccalla  

-----Original Message-----
From: Viktor Oudovenko [mailto:udo at physics.rutgers.edu] 
Sent: Thursday, May 19, 2005 12:00 PM
To: users at gridengine.sunsource.net
Subject: [GE users] Scheduler dies like a hell

Hi, everybody,

I am asking your help and ideas what could be done to restore
normal operation of the scheduler. First what happened. A few
time during last week our main server died and I needed to reboot
it and even replace it. But jobs which used automount  proceed
run. But from yesterday or day before yesterday scheduler demon
dies. I tried to restart sge_master but it did not help. Now when
demon died I start it manually simply typing:

/opt/SGE/bin/lx24-x86/sge_schedd

but after some time it died again. Please advice what could it be?

Below plz find some info form file messages:


05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no execd known on host sub04n87
to
send conf notification
05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no execd known on host sub04n88
to
send conf notification
05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no execd known on host sub04n89
to
send conf notification
05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no execd known on host sub04n90
to
send conf notification
05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no execd known on host sub04n91
to
send conf notification
05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no execd known on host
rupc04.rutgers.edu to send conf notification
05/19/2005 01:02:37|qmaster|rupc-cs04b|I|starting up 6.0u3
05/19/2005 01:08:11|qmaster|rupc-cs04b|E|commlib error: got read error
(closing connection)
05/19/2005 01:11:06|qmaster|rupc-cs04b|E|event client "scheduler"
(rupc-cs04b/schedd/1) reregistered - it will need a total update
05/19/2005 01:24:31|qmaster|rupc-cs04b|W|job 21171.1 failed on host
sub04n203 assumedly after job because: job 21171.1 died through signal
TERM
(15)
05/19/2005 05:17:19|qmaster|rupc-cs04b|E|acknowledge timeout after 600
seconds for event client (schedd:1) on host "rupc-cs04b"
05/19/2005 09:29:03|qmaster|rupc-cs04b|W|job 21060.1 failed on host
sub04n74
assumedly after job because: job 21060.1 died through signal TERM (15)
05/19/2005 09:30:37|qmaster|rupc-cs04b|E|event client "scheduler"
(rupc-cs04b/schedd/1) reregistered - it will need a total update
05/19/2005 11:04:21|qmaster|rupc-cs04b|W|job 20222.1 failed on host
sub04n29
assumedly after job because: job 20222.1 died through signal KILL (9)
05/19/2005 11:05:50|qmaster|rupc-cs04b|W|job 21212.1 failed on host
sub04n25
assumedly after job because: job 21212.1 died through signal KILL (9)
05/19/2005 12:04:51|qmaster|rupc-cs04b|E|acknowledge timeout after 600
seconds for event client (schedd:1) on host "rupc-cs04b"


At 01:02:37 I restarted sgemaster.

thank you very much for any information and help.

regards, viktor


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list