[GE users] qdel dows not kill the job when using pag command?

Duc Bao Ta ta at physik.uni-bonn.de
Fri Jun 1 10:28:39 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti schrieb:
> Am 31.05.2007 um 14:54 schrieb Duc Bao Ta:
>
>> Hi Reuti,
>>
>> I always try to delete the jobs with qdel and if after that nothing
>> happens I try qdel -f.
>> In the logs with log_level info I only get these messages on the exec
>> node:
>>
>> 05/31/2007 14:51:26|execd|silab03|I|SIGNAL jid: 472 jatask: 1 signal:
>> KILL
>> 05/31/2007 14:51:33|execd|silab03|I|SIGNAL jid: 472 jatask: 1 signal:
>> KILL
>>
>> now with qdel -f:
>> 05/31/2007 14:52:13|execd|silab03|I|SIGNAL jid: 472 jatask: 1 signal:
>> KILL
>> 05/31/2007 14:52:53|execd|silab03|I|SIGNAL jid: 472 jatask: 1 signal:
>> KILL
>> 05/31/2007 14:53:14|execd|silab03|I|SIGNAL jid: 472 jatask: 1 signal:
>> KILL
>>
>> I cannot tell if the SIGKILL is applied to the correct pid.
>
> We can try two things: what if you suspend the job - which processes
> are then in status "T"?
>
Actually if I don't see that the job script is suspended, with htop I
can see that the script ocasionally changes the status from R to D. This
lasts only for a few seconds.
> Second: can you submit the jobs with -notify and catch the SIGUSR2
> with a trap command in the jobscript?
>
The script looks like this:
#---

trap 'fatal_error "Job has been terminated by the batch system" "TERM"'
SIGTERM
trap 'fatal_error "Job has been terminated by the batch system" "INT"'
SIGINT
trap 'fatal_error "Job has been terminated by the batch system" "QUIT"'
SIGQUIT
trap 'fatal_error "Job has been terminated by the batch system" "ABRT"'
SIGABRT
trap 'fatal_error "Job has been terminated by the batch system" "USR1"'
SIGUSR1
trap 'fatal_error "Job has been terminated by the batch system" "USR2"'
SIGUSR2

echo "OLA"

fatal_error() {
        echo "hi $1 $2"
}

sleep 120  &
wait $!

#---

within the 120 seconds I delete the job, but the output is only "OLA",
so no signal was seen by the script.

I still suspect my pag script to confuse SGE. If you look again at the
PGIDs, they change after the PAG command and again after the coshepherd
command. The job scripts do not have the same PGID as the shepards and
as the PAG command.

Duc

> -- Reuti
>
>
>> Cheers
>> Duc
>>
>> Reuti schrieb:
>>> Am 30.05.2007 um 10:27 schrieb Duc Bao Ta:
>>>
>>>> Hi,
>>>>
>>>> I have read the postings about how qdel kills a job, but what process
>>>> does it kill, i.e. which process group does it kill?
>>>> My problem is that qdel does not delete the job, but the jobs
>>>> remains in
>>>> "dr" state. When I look at the process tree I can see the following (I
>>>> hope it is readable):
>>>>
>>>> USER PPID   PID  PGID   SID TPGID STAT   UID  COMMAND
>>>> root     1  3267  3267  2480    -1 S        0
>>>> /opt/sge/bin/lx24-x86/sge_execd
>>>> root  3267  3273  3267  2480    -1 S        0   \_ /bin/sh
>>>> /opt/sge/util/loadsensor.sh
>>>> root  3267 19737  3267  2480    -1 S        0   \_ /bin/sh
>>>> /opt/sge/util/pag -c exec /opt/sge/bin/lx24-x86/sge_shepherd -bg
>>>> root 19737 19741 19741  2480    -1 S        0   |   \_
>>>> /opt/sge/bin/lx24-x86/sge_shepherd -bg
>>>> root 19741 19754 19741  2480    -1 S        0   |       \_
>>>> /opt/sge/bin/lx24-x86/sge_coshepherd /opt/sge/util/set_token_cmd duc
>>>> 86400
>>>> duc 19741 19981 19981 19981    -1 SNs   1025   |       \_ /bin/bash
>>>> /opt/sge/sunfire/spool/silab03/job_scripts/281
>>>> duc 19981 19983 19981 19981    -1 SN    1025   |           \_ sleep
>>>> 2222222
>>>> root 3267 21297  3267  2480    -1 S        0   \_ /bin/sh
>>>> /opt/sge/util/pag -c exec /opt/sge/bin/lx24-x86/sge_shepherd -bg
>>>> root 21297 21301 21301  2480    -1 S        0       \_
>>>> /opt/sge/bin/lx24-x86/sge_shepherd -bg
>>>> root 21301 21314 21301  2480    -1 S        0           \_
>>>> /opt/sge/bin/lx24-x86/sge_coshepherd /opt/sge/util/set_token_cmd duc
>>>> 86400
>>>> duc 21301 21698 21698 21698    -1 SNs   1025           \_ /bin/bash
>>>> /opt/sge/sunfire/spool/silab03/job_scripts/294
>>>> duc 21698 21699 21698 21698    -1 SN    1025               \_ sleep
>>>> 2222222
>>>>
>>>> There are two jobs, still running after a forced deletion as a manager
>>>
>>> For the first job should be killed with -19981, hence the bash and the
>>> sleep. Can you check in the messages file of SGE in the spool
>>> directory for this node, whether it was issued (maybe loglevel has to
>>> be set to "loglevel log_info" in the SGE configuration).
>>>
>>> Did you also try first a qdel without -f?
>>>
>>> -- Reuti
>>>
>>>
>>>> user. I am using the set_token_cmd and pag_cmd options to get my
>>>> kerberos tickets and afs tokens, so I rely on this job execution
>>>> scheme.
>>>> Basically
>>>>
>>>>
>>>> When I kill manually (SIGTERM and SIGKILL) as root the "job_scripts"
>>>> processes, then the jobs terminates as desired (i.e. epilog script is
>>>> executed), when I try to kill the set_token_cmd nothing happens,
>>>> when I
>>>> kill the "sge_shepard" -bg process the jobs terminates directly
>>>> without
>>>> calling the epilog script.
>>>> Will the terminate method of the queue help here? Or should I
>>>> modify the
>>>> set_token_cmd and pag_cmd scripts?
>>>>
>>>>
>>>> Cheers
>>>> Duc
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list