Opened 6 years ago

Closed 6 years ago

#1449 closed enhancement (fixed)

Behavior of checkpoint_method on job termination

Reported by: wish Owned by: Dave Love <d.love@…>
Priority: normal Milestone:
Component: sge Version: 6.2u3
Severity: minor Keywords:
Cc:

Description

If a job is terminated (with ENABLE_ADDGRP_KILL=true )while it is being checkpointed the ckpt_command is not killed by grid engine. This can cause issues with some checkpointing tools (eg ompi-checkpoint command from openmpi when used with blcr) which don't terminate if you kill the processes it is trying to checkpoint. This isn't too hard to work around but should be documented.

Possibly one could delay termination of a job until after ckpt_command has finished running.

Change History (2)

comment:1 Changed 6 years ago by dlove

Is the additional group not actually added to the command currently?

That would include the hook commands in the accounting. It's not clear
to me if that's appropriate but it seems reasonable. WDYT?

comment:2 Changed 6 years ago by Dave Love <d.love@…>

  • Owner set to Dave Love <d.love@…>
  • Resolution set to fixed
  • Status changed from new to closed

In 4490/sge:

Fix #1449: add supplementary group to async-started commands
Enables them to be killed, but adds them to accounting -- arguably the
right thing anyway (e.g. expensive checkpointing).

Note: See TracTickets for help on using tickets.