Opened 15 years ago
Last modified 8 years ago
#354 new defect
IZ2045: Checkpointing: condition "s" not working as expected
Reported by: | reuti | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 6.0u7 |
Severity: | Keywords: | Linux kernel | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2045]
Issue #: 2045 Platform: Other Reporter: reuti (reuti) Component: gridengine OS: Linux Subcomponent: kernel Version: 6.0u7 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: andreas (andreas) QA Contact: andreas URL: * Summary: Checkpointing: condition "s" not working as expected Status whiteboard: Attachments: Issue 2045 blocks: Votes for issue 2045: Opened: Tue Apr 25 05:20:00 -0700 2006 ------------------------ The checkpointing condition "s" is not handled as expected. Documentation states (man checkpoint): A job is checkpointed, aborted and if possible migrated if the corresponding sge_execd(8) is shut down on the job's machine. Instead the job is rescheduled (and has to wait from this point on) when the the execd is starting up again. ------- Additional comments from petrik Thu Oct 8 09:26:16 -0700 2009 ------- Can you describe in more detail that is your expected behavior? I want to understand if this is rather a bug in the man pages or if the current behavior should be improved. Thanks, Lubos. ------- Additional comments from reuti Mon Oct 12 08:55:01 -0700 2009 ------- Well, it's some time ago when I looked closely into the checkpointing feature. First of all, the question is, what is meant by shutdown of the execd. This can be done in several ways: 1 - qconf -ke 2 - qconf -kej 3 - /etc/init.d/sgeexed stop (on the job's machine) 4 - kill $pid (of the execd on the job's machine) At the time of entering this issue I used 3. The job continues to run on the node (not killed or aborted), while the sgeexecd and the associated shepherd for this job were shut down. When I restart the sgeexecd, the job will get rescheduled. And, as I wrote, has to wait in queued state from this point on. Attention: this is not always the case. There seems to be a race-condition in addition and sometimes it works as it should (at least: the job gets rescheduled, but still not aborted as you can check in `ps -e f`). Nevertheless, as the job isn't aborted, you may have it twice on the machine afterwards like the 1927 here when the sgeexecd is restarted: 26822 ? Ss 0:00 /bin/sh /var/spool/sge/pc15381/job_scripts/1926 26823 ? S 0:00 \_ sleep 600 27074 ? Ss 0:00 /bin/sh /var/spool/sge/pc15381/job_scripts/1927 27075 ? S 0:00 \_ sleep 600 27215 ? Sl 0:00 /usr/sge/bin/lx24-x86/sge_execd 27257 ? S 0:00 \_ sge_shepherd-1927 -bg 27258 ? Ss 0:00 \_ /bin/sh /var/spool/sge/pc15381/job_scripts/1927 27259 ? S 0:00 \_ sleep 600 For "s" and "x" the documentation issue is, that the checkpoints won't get created before the migration (so the beginning of each explanation is wrong). Only in the "m" case they will be triggered to be created. There are several checkpointing issues: 2037-2045
Note: See
TracTickets for help on using
tickets.
In 4349/sge: