[GE users] Suspending jobs submitted with -notify

Olle Liljenzin olle at carmen.se
Fri Jan 28 14:41:11 GMT 2005


Dan Gruhn wrote:
> Ollie,
> 
> I've been wrestling with qsub -notify for a week or more myself.  Please 
> take a look at issue #1440 to see what I have observed as problems.  
> What I have seen about USR1 is to do the following in my job script:
> 
> trap "" USR1
> 
> This tells the system to just ignore the USR1 signal.  I found that 
> trying to handle it, even to output a status message, was a problem if I 
> wasn't planning to exit after handling it.  In my case, my script1 had 
> usually called another script2 and if USR1 came in and I tried to output 
> a message, it would look to my script1 as if script2 had returned with 
> an exit status greater than 128.
> 
> Are you using a script for your job?
> 
> Have you tried just ignoring USR1?

I first tried with a script. Then I wrote a short C program to make sure 
bash wasn't involved somehow:

trap.c:
	#include <signal.h>
	#include <stdio.h>

	struct sigaction sa;

	void handler(int sig)
	{
	  fprintf(stderr, "signal %d\n", sig);
	}

	void set_handler(int sig)
	{
	  if(sigaction(sig, &sa, 0) == 0)
	    fprintf(stderr, "Installed handler for signal %d\n", sig);
	  else
	    fprintf(stderr, "Failed to install handler for signal %d\n", sig);
	}

	int main(int argc, char *argv[])
	{
	  int i;

	  sa.sa_handler = handler;

	  for(i = 1; i < 64; i++)
	    set_handler(i);

	  while(1);

	  return 0;
	}

When submitted with 'qrsh -notify' the process will print 'signal 10' 
(on Linux) after a 'qmod -sj'. When submitted with 'qsub -notify' it 
will dye silently after printing 'signal 10'. Why the different behaviour?

> What have you set your notify time to on your job queues?

notify                00:00:60

> Note that USR2 is very helpful as it lets your job know that it is about 
> to be killed and you can do some cleanup before that.  It has some 
> problems, as I have noted in Issue 1440.

It will not explain why the process gets killed. It managed to set up a 
handler for all signals except KILL and STOP. Sending a USR2 to the 
process will just print a message on stderr.

> -Dan
> 
> 
> On Fri, 2005-01-28 at 05:05, Olle Liljenzin wrote:
> 
>>/I have problems with jobs submitted with 'qsub -notify' getting killed 
>>when suspending them with 'qmod -sj'.
>>
>>I have set up a trap for all signals that can be caught. When the job is 
>>suspended it reports that SIGUSR1 was caught, but in the next moment the 
>>process it is just gone. The only reason I can see for that it 
>>dissappers would be that it directly after SIGUSR1 gets a second signal 
>>that kills the process.
>>
>>Suspending a job that was submitted with 'qrsh -notify' work as 
>>expected. The process gets a SIGUSR1 and after a while it falls into sleep.
>>
>>Is it something I should change in the configuration or is it just that 
>>I don't understand how it is supposed to work?
>>
>>I'm running version 6.0u3 and I have tried it on Linux, Solaris, AIX and 
>>HP-UX with the same result.
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>/
>>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list