Opened 8 years ago

Last modified 8 years ago

#1283 new defect

job continues when node reboots

Reported by: dlove Owned by:
Priority: normal Milestone:
Component: sge Version: 6.2u5
Severity: minor Keywords:
Cc:

Description

This sort of thing has been reported before, but I can't spot it in
the tracker.

This is the log from a (slave) node which rebooted itself after a
memory error. Note the report that the job had failed, but the job
wasn't deleted, and qhost still showed the node running it.

11/14/2010 21:55:52|  main|lvgig020|E|can't find connection
11/14/2010 21:55:52|  main|lvgig020|E|can't get configuration from qmaster -- backgrounding
11/14/2010 21:55:55|  main|lvgig020|I|registered at qmaster host "lv3.nw-grid.ac.uk"
11/14/2010 21:55:55|  main|lvgig020|I|starting up GE 6.2u5 (lx26-amd64)
11/14/2010 21:55:55|  main|lvgig020|I|successfully started PDC and PTF
11/14/2010 21:55:55|  main|lvgig020|I|checking for old jobs
11/14/2010 21:55:55|  main|lvgig020|I|found directory of job "active_jobs/25993.1/1.lvgig020"
11/14/2010 21:55:55|  main|lvgig020|I|shepherd for job active_jobs/25993.1/1.lvgig020 has pid "17551" and is not alive
11/14/2010 21:55:55|  main|lvgig020|E|abnormal termination of shepherd for job 25993.1 task 1.lvgig020: "exit_status" file is empty
11/14/2010 21:55:55|  main|lvgig020|E|can't open usage file "active_jobs/25993.1/1.lvgig020/usage" for job 25993.1: No such file or directory
11/14/2010 21:55:55|  main|lvgig020|E|shepherd exited with exit status 19: before writing exit_status
11/14/2010 21:55:55|  main|lvgig020|I|sending admin mail mail to user "***"|mailer "/bin/mail"|"GE 6.2u5: Job 25993 failed"

The master log shows

11/14/2010 21:55:52|listen|lv3|E|commlib error: endpoint is not unique error (endpoint "lvgig020.nw-grid.ac.uk/execd/1" is already connected)
11/14/2010 21:55:52|listen|lv3|E|commlib error: got select error (Connection reset by peer)
11/14/2010 21:55:55|worker|lv3|I|execd on lvgig020.nw-grid.ac.uk registered
11/14/2010 21:55:55|worker|lv3|I|task 1.lvgig020 at lvgig020.nw-grid.ac.uk of job 25993.1 failed 19

Change History (2)

comment:1 Changed 8 years ago by dlove

Here's another example, where the crashed node was the master for one MPI job (61800) and a slave for another (62291). The job for which the host is master is killed OK, but not the other one, where qhost on the crashed node shows it still running.

When the crashed node restarts, the qmaster shows this for the job for which the node was the MPI master:

04/26/2011 05:39:45|worker|lv3|I|removing trigger to terminate job 61800.1
04/27/2011 05:59:44|worker|lv3|I|removing trigger to terminate job 61800.1
04/28/2011 00:28:27| timer|lv3|I|added trigger to terminate job 61800.1 when runtime limit is reached (224627 + 300)
04/28/2011 06:06:16|worker|lv3|I|removing trigger to terminate job 61800.1
04/28/2011 06:07:34| timer|lv3|I|added trigger to terminate job 61800.1 when runtime limit is reached (204280 + 300)
04/28/2011 09:52:52|worker|lv3|I|removing trigger to terminate job 61800.1
04/28/2011 09:52:56|worker|lv3|I|task 1.lvgig120 at lvgig120.nw-grid.ac.uk of job 61800.1 died through signal KILL
04/28/2011 09:52:56|worker|lv3|E|tightly integrated parallel task 61800.1 task 1.lvgig120 failed - killing job
04/28/2011 09:52:56|worker|lv3|I|task 1.lvgig121 at lvgig121.nw-grid.ac.uk of job 61800.1 died through signal KILL
04/28/2011 09:53:13|worker|lv3|I|task 1.lvgig118 at lvgig118.nw-grid.ac.uk of job 61800.1 died through signal KILL
04/28/2011 09:53:23|worker|lv3|I|task 1.lvgig125 at lvgig125.nw-grid.ac.uk of job 61800.1 died through signal KILL
04/28/2011 09:53:24|worker|lv3|I|task 1.lvgig124 at lvgig124.nw-grid.ac.uk of job 61800.1 died through signal KILL
04/28/2011 09:53:32|worker|lv3|I|removing trigger to terminate job 61800.1

and the rebooted node:

04/28/2011 09:52:52|  main|lvgig122|I|found directory of job "active_jobs/61800.1"
04/28/2011 09:52:52|  main|lvgig122|I|shepherd for job active_jobs/61800.1 has pid "2142" and is not alive
04/28/2011 09:52:52|  main|lvgig122|E|abnormal termination of shepherd for job 61800.1: "exit_status" file is empty
04/28/2011 09:52:52|  main|lvgig122|E|can't open usage file "active_jobs/61800.1/usage" for job 61800.1: No such file or directory
04/28/2011 09:52:52|  main|lvgig122|I|sending admin mail mail to user "d.love@liv.ac.uk"|mailer "/bin/mail"|"GE 6.2u5: Job 61800 failed"

For the job for which this node was a slave, the master shows:

04/28/2011 09:52:52|worker|lv3|I|removing trigger to terminate job 62291.1
04/28/2011 09:52:52|worker|lv3|I|task 1.lvgig122 at lvgig122.nw-grid.ac.uk of job 62291.1 failed 19
04/28/2011 10:37:05|worker|lv3|I|root has registered the job 62291 for deletion
04/28/2011 10:37:06|worker|lv3|I|task 1.lvgig125 at lvgig125.nw-grid.ac.uk of job 62291.1 died through signal KILL
04/28/2011 10:37:06|worker|lv3|I|task 1.lvgig124 at lvgig124.nw-grid.ac.uk of job 62291.1 died through signal KILL
04/28/2011 10:37:08|worker|lv3|I|task 1.lvgig121 at lvgig121.nw-grid.ac.uk of job 62291.1 died through signal KILL
04/28/2011 10:38:18|worker|lv3|I|removing trigger to terminate job 62291.1
04/28/2011 10:38:18|worker|lv3|W|job 62291.1 failed on host lvgig119.nw-grid.ac.uk assumedly after job because: job 62291.1 died through signal KILL (9)

and the rebooted node:

04/28/2011 09:52:52|  main|lvgig122|I|found directory of job "active_jobs/62291.1/1.lvgig122"
04/28/2011 09:52:52|  main|lvgig122|I|shepherd for job active_jobs/62291.1/1.lvgig122 has pid "27520" and is not alive
04/28/2011 09:52:52|  main|lvgig122|E|abnormal termination of shepherd for job 62291.1 task 1.lvgig122: "exit_status" file is empty
04/28/2011 09:52:52|  main|lvgig122|E|can't open usage file "active_jobs/62291.1/1.lvgig122/usage" for job 62291.1: No such file or directory
04/28/2011 09:52:52|  main|lvgig122|I|sending admin mail mail to user "d.love@liv.ac.uk"|mailer "/bin/mail"|"GE 6.2u5: Job 62291 failed"
04/28/2011 10:09:29|  main|lvgig122|E|delayed registering job "62291" task 1.lvgig122 at ptf during startup
04/28/2011 10:09:29|  main|lvgig122|E|can't stat "active_jobs/62291.1/1.lvgig122": No such file or directory
04/28/2011 10:09:29|  main|lvgig122|E|can't stat "active_jobs/62291.1/1.lvgig122": No such file or directory

... ad infinitum until the job is qdel'd.

comment:2 Changed 8 years ago by dlove

An entry in the common problems howto has similar symptoms, but this case seems unrelated.

Note: See TracTickets for help on using tickets.