[GE users] Scheduler dies like a hell

Viktor Oudovenko udo at physics.rutgers.edu
Thu May 19 18:00:06 BST 2005


Hi, everybody,

I am asking your help and ideas what could be done to restore
normal operation of the scheduler. First what happened. A few
time during last week our main server died and I needed to reboot
it and even replace it. But jobs which used automount  proceed
run. But from yesterday or day before yesterday scheduler demon
dies. I tried to restart sge_master but it did not help. Now when
demon died I start it manually simply typing:

/opt/SGE/bin/lx24-x86/sge_schedd

but after some time it died again. Please advice what could it be?

Below plz find some info form file messages:


05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no execd known on host sub04n87 to
send conf notification
05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no execd known on host sub04n88 to
send conf notification
05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no execd known on host sub04n89 to
send conf notification
05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no execd known on host sub04n90 to
send conf notification
05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no execd known on host sub04n91 to
send conf notification
05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no execd known on host
rupc04.rutgers.edu to send conf notification
05/19/2005 01:02:37|qmaster|rupc-cs04b|I|starting up 6.0u3
05/19/2005 01:08:11|qmaster|rupc-cs04b|E|commlib error: got read error
(closing connection)
05/19/2005 01:11:06|qmaster|rupc-cs04b|E|event client "scheduler"
(rupc-cs04b/schedd/1) reregistered - it will need a total update
05/19/2005 01:24:31|qmaster|rupc-cs04b|W|job 21171.1 failed on host
sub04n203 assumedly after job because: job 21171.1 died through signal TERM
(15)
05/19/2005 05:17:19|qmaster|rupc-cs04b|E|acknowledge timeout after 600
seconds for event client (schedd:1) on host "rupc-cs04b"
05/19/2005 09:29:03|qmaster|rupc-cs04b|W|job 21060.1 failed on host sub04n74
assumedly after job because: job 21060.1 died through signal TERM (15)
05/19/2005 09:30:37|qmaster|rupc-cs04b|E|event client "scheduler"
(rupc-cs04b/schedd/1) reregistered - it will need a total update
05/19/2005 11:04:21|qmaster|rupc-cs04b|W|job 20222.1 failed on host sub04n29
assumedly after job because: job 20222.1 died through signal KILL (9)
05/19/2005 11:05:50|qmaster|rupc-cs04b|W|job 21212.1 failed on host sub04n25
assumedly after job because: job 21212.1 died through signal KILL (9)
05/19/2005 12:04:51|qmaster|rupc-cs04b|E|acknowledge timeout after 600
seconds for event client (schedd:1) on host "rupc-cs04b"


At 01:02:37 I restarted sgemaster.

thank you very much for any information and help.

regards, viktor


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list