Opened 15 years ago

Last modified 9 years ago

#221 new defect

IZ1421: Abort mails sent in case of DRMAA job failure are misleading

Reported by: templedf Owned by:
Priority: low Milestone:
Component: sge Version: 6.0u2
Severity: Keywords: drmaa
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1421]

        Issue #:      1421             Platform:     All      Reporter: templedf (templedf)
       Component:     gridengine          OS:        All
     Subcomponent:    drmaa            Version:      6.0u2       CC:    None defined
        Status:       NEW              Priority:     P4
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     templedf
          URL:
       * Summary:     Abort mails sent in case of DRMAA job failure are misleading
   Status whiteboard:
      Attachments:

     Issue 1421 blocks:
   Votes for issue 1421:


   Opened: Wed Jan 19 06:02:00 -0700 2005 
------------------------


If the input path does not contain a "/", the
shepherd will die as soon as it's launched when
delegated file staging is enabled.  The following
DRMAA program reproduces the error:

import org.ggf.drmaa.*;

public class STDIN {
   public static void main (String[] args) {
      SessionFactory factory =
SessionFactory.getFactory ();
      Session session = factory.getSession ();

      try {
         session.init (null);
         JobTemplate jt =
session.createJobTemplate ();
         jt.setRemoteCommand (args[0]);
         jt.setArgs (new String[] {"5"});
         jt.setInputPath ("balin:blahblahblah");
         jt.setTransferFiles (new FileTransferMode
(true, false, false));

         String id = session.runJob (jt);

         System.out.println ("Your job has been
submitted with id " + id);

         session.deleteJobTemplate (jt);
         session.exit ();
      }
      catch (DrmaaException e) {
         System.out.println ("Error: " +
e.getMessage ());
      }
   }
}

   ------- Additional comments from sgrell Wed Jan 19 07:11:16 -0700 2005 -------
An email is send and the email states, that the job is in error
state. However, the job is gone with out leaving any trace, except
for errors in the message files and an email:

Job 864 caused action: Job 864 set to ERROR
 User        = dant
 Queue       = all.q@bolek
 Host        = bolek
 Start Time  = <unknown>
 End Time    = <unknown>
failed opening input/output file:01/19/2005 13:47:56 [40240:130650]:
error: can't open /tmp/864.1.all.q/blahblahblah as dummy input f
Shepherd trace:
01/19/2005 13:47:56 [40240:130647]: shepherd called with uid = 0, euid
= 40240
01/19/2005 13:47:56 [40240:130647]: starting up 6.0u3
01/19/2005 13:47:56 [40240:130647]: setpgid(130647, 130647) returned 0
01/19/2005 13:47:56 [40240:130647]: no prolog script to start
01/19/2005 13:47:56 [40240:130647]: forked "job" with pid 130650
01/19/2005 13:47:56 [40240:130647]: child: job - pid: 130650
01/19/2005 13:47:56 [40240:130650]: pid=130650 pgrp=130650 sid=130650
old pgrp=130647 getlogin()=dant
01/19/2005 13:47:56 [40240:130650]: setosjobid: uid = 0, euid = 40240
01/19/2005 13:47:56 [40240:130650]: RLIMIT_CPU setting: (soft
9223372036854775807 hard 9223372036854775807) resulting: (soft
9223372036854775807 hard 9223372036854775807)
01/19/2005 13:47:56 [40240:130650]: RLIMIT_FSIZE setting: (soft
9223372036854775807 hard 9223372036854775807) resulting: (soft
9223372036854775807 hard 9223372036854775807)
01/19/2005 13:47:56 [40240:130650]: RLIMIT_DATA setting: (soft
9223372036854775807 hard 9223372036854775807) resulting: (soft
1073741824 hard 9223372036854775807)
01/19/2005 13:47:56 [40240:130650]: RLIMIT_STACK setting: (soft
9223372036854775807 hard 9223372036854775807) resulting: (soft
33554432 hard 9223372036854775807)
01/19/2005 13:47:56 [40240:130650]: RLIMIT_CORE setting: (soft
9223372036854775807 hard 9223372036854775807) resulting: (soft
9223372036854775807 hard 9223372036854775807)
01/19/2005 13:47:56 [40240:130650]: RLIMIT_VMEM setting: (soft
9223372036854775807 hard 9223372036854775807) resulting: (soft
4398046511104 hard 9223372036854775807)
01/19/2005 13:47:56 [40240:130650]: RLIMIT_RSS setting: (soft
9223372036854775807 hard 9223372036854775807) resulting: (soft
1029054464 hard 9223372036854775807)
01/19/2005 13:47:56 [40240:130650]: closing all filedescriptors
01/19/2005 13:47:56 [40240:130650]: further messages are in "error"
and "trace"
01/19/2005 13:47:56 [40240:130647]: wait3 returned 130650 (status:
6656; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 26)
01/19/2005 13:47:56 [40240:130647]: job exited with exit status 26
01/19/2005 13:47:56 [40240:130647]: reaped "job" with pid 130650
01/19/2005 13:47:56 [40240:130647]: job exited not due to signal
01/19/2005 13:47:56 [40240:130647]: job exited with status 26
01/19/2005 13:47:56 [40240:130647]: now sending signal KILL to pid -130650
01/19/2005 13:47:56 [40240:130647]: no tasker to notify
01/19/2005 13:47:56 [40240:130647]: failed starting job
01/19/2005 13:47:56 [40240:130647]: no epilog script to start

Shepherd error:
01/19/2005 13:47:56 [40240:130650]: error: can't open
/tmp/864.1.all.q/blahblahblah as dummy input file

Shepherd pe_hostfile:
bolek 1 all.q@bolek UNDEFINED




   ------- Additional comments from andreas Mon Jan 24 09:44:06 -0700 2005 -------
Delegated file staging forsees the prolog to return with error code
100 if file staging failed (sge_conf(5)). Was there a related prolog
configured in this case?

   ------- Additional comments from andreas Mon Jan 24 10:07:47 -0700 2005 -------
Job submitted through DRMAA in principle never enter the error state.
This is because there is no error state in DRMAA. If drmaa_wait()
indicates job failure there is no error except the misleading text
"Job 864 set to ERROR". A text such as "Job 864 failed" certainly
would be better.

So the interesting question is whether drmaa_wait() would have
indicated job failure as foreseen ... ?

   ------- Additional comments from andreas Tue Jan 25 03:14:49 -0700 2005 -------
There is no misbehaviour besides misleading email.

Chaning summary, priority and subcomponent.

Change History (0)

Note: See TracTickets for help on using tickets.