Opened 13 years ago

Last modified 7 years ago

#354 new defect

IZ2045: Checkpointing: condition "s" not working as expected

Reported by: reuti Owned by:
Priority: normal Milestone:
Component: sge Version: 6.0u7
Severity: Keywords: Linux kernel
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2045]

        Issue #:      2045             Platform:     Other    Reporter: reuti (reuti)
       Component:     gridengine          OS:        Linux
     Subcomponent:    kernel           Version:      6.0u7       CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     andreas
          URL:
       * Summary:     Checkpointing: condition "s" not working as expected
   Status whiteboard:
      Attachments:

     Issue 2045 blocks:
   Votes for issue 2045:


   Opened: Tue Apr 25 05:20:00 -0700 2006 
------------------------


The checkpointing condition "s" is not handled as expected. Documentation states (man checkpoint):

A job is checkpointed, aborted and if possible migrated if the corresponding sge_execd(8)  is  shut  down
on  the job's machine.

Instead the job is rescheduled (and has to wait from this point on) when the the execd is starting up again.

   ------- Additional comments from petrik Thu Oct 8 09:26:16 -0700 2009 -------
Can you describe in more detail that is your expected behavior?

I want to understand if this is rather a bug in the man pages or if the current behavior should be improved.

Thanks,
   Lubos.

   ------- Additional comments from reuti Mon Oct 12 08:55:01 -0700 2009 -------
Well, it's some time ago when I looked closely into the checkpointing feature. First of all, the question is, what is meant by shutdown of the execd. This can be done in several ways:

1 - qconf -ke
2 - qconf -kej
3 - /etc/init.d/sgeexed stop (on the job's machine)
4 - kill $pid (of the execd on the job's machine)

At the time of entering this issue I used 3. The job continues to run on the node (not killed or aborted), while the sgeexecd and the associated shepherd for this job were shut down. When I
restart the sgeexecd, the job will get rescheduled. And, as I wrote, has to wait in queued state from this point on. Attention: this is not always the case. There seems to be a race-condition
in addition and sometimes it works as it should (at least: the job gets rescheduled, but still not aborted as you can check in `ps -e f`). Nevertheless, as the job isn't aborted, you may have it
twice on the machine afterwards like the 1927 here when the sgeexecd is restarted:

26822 ?        Ss     0:00 /bin/sh /var/spool/sge/pc15381/job_scripts/1926
26823 ?        S      0:00  \_ sleep 600
27074 ?        Ss     0:00 /bin/sh /var/spool/sge/pc15381/job_scripts/1927
27075 ?        S      0:00  \_ sleep 600
27215 ?        Sl     0:00 /usr/sge/bin/lx24-x86/sge_execd
27257 ?        S      0:00  \_ sge_shepherd-1927 -bg
27258 ?        Ss     0:00      \_ /bin/sh /var/spool/sge/pc15381/job_scripts/1927
27259 ?        S      0:00          \_ sleep 600

For "s" and "x" the documentation issue is, that the checkpoints won't get created before the migration (so the beginning of each explanation is wrong). Only in the "m" case they will be
triggered to be created.

There are several checkpointing issues: 2037-2045

Change History (1)

comment:1 Changed 7 years ago by Dave Love <d.love@…>

In 4349/sge:

Man fixes (fixes #1436, refs #354)
Moves sge_ckpt.1 to sge_ckpt.5

Note: See TracTickets for help on using tickets.