[GE users] Scripts differing between user enviroment and GridEngine environment

Timothy Bateman timb at cisra.canon.com.au
Tue Apr 12 02:59:04 BST 2005


On Sat, 9 Apr 2005, Ron Chen wrote:

> what is the value of "shell_start_mode"?
> 
> qconf -mconf, then set it to "unix_behavior".
> 
>  -Ron
> 

Ron,

	This certainly helped. I can now run the script, but I see 
different interrupt trapping behaviour between a GE run and a non-GE run. 

Here are my scripts (the parent first):

#!/bin/sh

function cleanup() {
  echo "Cleaning up by killing processId " $processId;
  kill -TERM $processId;
}

trap 'echo "Detected USR1"' USR1
trap 'echo "Detected USR2"' USR2
trap 'echo "Detected ABRT"' ABRT
trap 'echo "Detected TERM";cleanup;exit 1' TERM
trap 'echo "Detected SUSPEND"' TSTP 
trap 'echo "Detected RESUME"' CONT

/u/timb/bin/echo_something &
processId=$!

while true
do
  echo "Parent loop running with PID" $$
  sleep 5
done


and the child script:

#!/bin/sh

trap 'echo "Child Detected USR1"' USR1
trap 'echo "Child Detected USR2"' USR2
trap 'echo "Child Detected ABRT"' ABRT
trap 'echo "Child Detected SUSPEND"' TSTP 
trap 'echo "Child Detected RESUME"' CONT

while true
do
  echo "Child PID is" $$
  sleep 5
done

When I run in an xterm, with the following commands:

werbin timb > kill -USR1 1946
werbin timb > kill -TSTP 1946
werbin timb > kill -CONT 1946
werbin timb > kill -USR2 1946
werbin timb > kill -TERM 1946

I get the following from output:

werbin timb > ./bin/trap_bashtest2.sh    
Parent loop running with PID 1946
Child PID is 1947
Parent loop running with PID 1946
Child PID is 1947
Detected USR1
Parent loop running with PID 1946
Child PID is 1947
Parent loop running with PID 1946
Child PID is 1947
Detected SUSPEND
Parent loop running with PID 1946
Child PID is 1947
Parent loop running with PID 1946
Child PID is 1947
Parent loop running with PID 1946
Child PID is 1947
Parent loop running with PID 1946
Child PID is 1947
Detected RESUME
Parent loop running with PID 1946
Child PID is 1947
Child PID is 1947
Parent loop running with PID 1946
Child PID is 1947
Detected USR2
Parent loop running with PID 1946
Child PID is 1947
Parent loop running with PID 1946
Detected TERM
Child PID is 1947
Cleaning up by killing processId  1947

Which is as I expect.

Yet when I run as a GE job, with the following actions:

SUSPEND - RESUME - DELETE

I get:

Parent loop running with PID 22506
Child PID is 22507
Detected USR1
Child Detected USR1
Parent loop running with PID 22506
Child PID is 22507
Parent loop running with PID 22506
Child PID is 22507
Detected RESUME
Child Detected RESUME
Parent loop running with PID 22506
Child PID is 22507
Parent loop running with PID 22506
Child PID is 22507
Detected USR2
Child Detected USR2
Parent loop running with PID 22506
Child PID is 22507
Parent loop running with PID 22506
Child PID is 22507

So the child process is receiving interrupts that are masked in the non GE 
case, and the TERM isn't trapped.

My motivation behind this is to kill a process runnign within the script 
before the parent dies, do some clean up of temporary directories, and 
then exit the parent.

I am submitting a job with qsub -notify.

Any suggestions as to how best to get this working ?

Many thanks,

Tim Bateman




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list