[GE users] Suspending jobs submitted with -notify

Reuti reuti at staff.uni-marburg.de
Fri Jan 28 15:31:26 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

As Dan pointed out: are you using trap "" USR1 in your script? If not, the 
enclosing bash will be killed and so the started child process, because of the 
default behavior for usr1 of the bash. And as also mentioned, you can't execute 
any command in the enclosing bash, because after the handling of the trap, the 
bash will not enter the wait state again to wait for the return of the started 
program. Instead it continues with the next step in your bash script.

This explains the observed behavior. But anyway, you can try to start your 
program in the script with a prefixed "exec" to the program call, like "exec 
~/a.out". This way the enclosing bash will be replaced by your main c program.

Cheers - Reuti


Quoting Olle Liljenzin <olle at carmen.se>:

> Dan Gruhn wrote:
> > Ollie,
> > 
> > I've been wrestling with qsub -notify for a week or more myself.  Please 
> > take a look at issue #1440 to see what I have observed as problems.  
> > What I have seen about USR1 is to do the following in my job script:
> > 
> > trap "" USR1
> > 
> > This tells the system to just ignore the USR1 signal.  I found that 
> > trying to handle it, even to output a status message, was a problem if I 
> > wasn't planning to exit after handling it.  In my case, my script1 had 
> > usually called another script2 and if USR1 came in and I tried to output 
> > a message, it would look to my script1 as if script2 had returned with 
> > an exit status greater than 128.
> > 
> > Are you using a script for your job?
> > 
> > Have you tried just ignoring USR1?
> 
> I first tried with a script. Then I wrote a short C program to make sure 
> bash wasn't involved somehow:
> 
> trap.c:
> 	#include <signal.h>
> 	#include <stdio.h>
> 
> 	struct sigaction sa;
> 
> 	void handler(int sig)
> 	{
> 	  fprintf(stderr, "signal %d\n", sig);
> 	}
> 
> 	void set_handler(int sig)
> 	{
> 	  if(sigaction(sig, &sa, 0) == 0)
> 	    fprintf(stderr, "Installed handler for signal %d\n", sig);
> 	  else
> 	    fprintf(stderr, "Failed to install handler for signal %d\n", sig);
> 	}
> 
> 	int main(int argc, char *argv[])
> 	{
> 	  int i;
> 
> 	  sa.sa_handler = handler;
> 
> 	  for(i = 1; i < 64; i++)
> 	    set_handler(i);
> 
> 	  while(1);
> 
> 	  return 0;
> 	}
> 
> When submitted with 'qrsh -notify' the process will print 'signal 10' 
> (on Linux) after a 'qmod -sj'. When submitted with 'qsub -notify' it 
> will dye silently after printing 'signal 10'. Why the different behaviour?
> 
> > What have you set your notify time to on your job queues?
> 
> notify                00:00:60
> 
> > Note that USR2 is very helpful as it lets your job know that it is about 
> > to be killed and you can do some cleanup before that.  It has some 
> > problems, as I have noted in Issue 1440.
> 
> It will not explain why the process gets killed. It managed to set up a 
> handler for all signals except KILL and STOP. Sending a USR2 to the 
> process will just print a message on stderr.
> 
> > -Dan
> > 
> > 
> > On Fri, 2005-01-28 at 05:05, Olle Liljenzin wrote:
> > 
> >>/I have problems with jobs submitted with 'qsub -notify' getting killed 
> >>when suspending them with 'qmod -sj'.
> >>
> >>I have set up a trap for all signals that can be caught. When the job is 
> >>suspended it reports that SIGUSR1 was caught, but in the next moment the 
> >>process it is just gone. The only reason I can see for that it 
> >>dissappers would be that it directly after SIGUSR1 gets a second signal 
> >>that kills the process.
> >>
> >>Suspending a job that was submitted with 'qrsh -notify' work as 
> >>expected. The process gets a SIGUSR1 and after a while it falls into
> sleep.
> >>
> >>Is it something I should change in the configuration or is it just that 
> >>I don't understand how it is supposed to work?
> >>
> >>I'm running version 6.0u3 and I have tried it on Linux, Solaris, AIX and 
> >>HP-UX with the same result.
> >>
> >>
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>/
> >>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list