Opened 13 years ago

Last modified 9 years ago

#416 new defect

IZ2223: Tightly integrated interactive parallel job killed when a task fails

Reported by: sgaure Owned by:
Priority: normal Milestone:
Component: sge Version: 6.0u8
Severity: Keywords: Linux qmaster
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2223]

        Issue #:      2223             Platform:     All      Reporter: sgaure (sgaure)
       Component:     gridengine          OS:        Linux
     Subcomponent:    qmaster          Version:      6.0u8       CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    ernst (ernst)
      QA Contact:     ernst
          URL:
       * Summary:     Tightly integrated interactive parallel job killed when a task fails
   Status whiteboard:
      Attachments:

     Issue 2223 blocks:
   Votes for issue 2223:


   Opened: Tue Mar 27 01:56:00 -0700 2007 
------------------------


A tightly integrated parallel interactive job is killed if one of the tasks
fails.  The problem has been observed in 6.0u8 but it seems to be the case for
the most current cvs too.

Example:
$ qconf -sp mpich2
...
control_slaves TRUE
job_is_first_task FALSE
...
# mpich2 has been set up to start its remote processes with "qrsh -inherit"
# It works flawlessly except when something goes wrong:

$ qlogin -pe mpich2 8
...
Your interactive job 33832 has been successfully scheduled.
Establishing /opt/gridengine/bin/rocks-qlogin.sh session to host compute-4-
11.local
...

...
mpiexec -n 8 -machinefile ... /bin/sleep 120
...
^C
$ # just wait for some seconds
Connection to compute-4-11.local closed by remote host.
Connection to compute-4-11.local closed.
/opt/gridengine/bin/rocks-qlogin.sh exited with exit code 255

The message file says:

03/27/2007 10:42:34|qmaster|commander|E|tightly integrated parallel task
33832.1 task 2.compute-3-17 failed - killing job
03/27/2007 10:43:06|qmaster|commander|W|job 33832.1 failed on host compute-4-
11.local assumedly after job because: job 33832.1 died through signal KILL (9)

So, I control-C an mpi-program in an interactive job, and the job gets killed.

This functionality is probably quite ok most of the time for batch jobs, but
for interactive jobs it's most often a nuisance because interactive jobs
typically are used for debugging purposes.

Change History (0)

Note: See TracTickets for help on using tickets.