[GE users] qdel dows not kill the job when using pag command?

Duc Bao Ta ta at physik.uni-bonn.de
Fri Jun 1 14:43:39 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti schrieb:
> Am 01.06.2007 um 11:28 schrieb Duc Bao Ta:
>
>> Reuti schrieb:
>>> Am 31.05.2007 um 14:54 schrieb Duc Bao Ta:
>>>
>>>> Hi Reuti,
>>>>
>>>> I always try to delete the jobs with qdel and if after that nothing
>>>> happens I try qdel -f.
>>>> In the logs with log_level info I only get these messages on the exec
>>>> node:
>>>>
>>>> 05/31/2007 14:51:26|execd|silab03|I|SIGNAL jid: 472 jatask: 1 signal:
>>>> KILL
>>>> 05/31/2007 14:51:33|execd|silab03|I|SIGNAL jid: 472 jatask: 1 signal:
>>>> KILL
>>>>
>>>> now with qdel -f:
>>>> 05/31/2007 14:52:13|execd|silab03|I|SIGNAL jid: 472 jatask: 1 signal:
>>>> KILL
>>>> 05/31/2007 14:52:53|execd|silab03|I|SIGNAL jid: 472 jatask: 1 signal:
>>>> KILL
>>>> 05/31/2007 14:53:14|execd|silab03|I|SIGNAL jid: 472 jatask: 1 signal:
>>>> KILL
>>>>
>>>> I cannot tell if the SIGKILL is applied to the correct pid.
>>>
>>> We can try two things: what if you suspend the job - which processes
>>> are then in status "T"?
>>>
>> Actually if I don't see that the job script is suspended, with htop I
>> can see that the script ocasionally changes the status from R to D. This
>> lasts only for a few seconds.
>
> D is something like delayed because of I/O and counts as running
> regarding the load of a machine.
>
>>> Second: can you submit the jobs with -notify and catch the SIGUSR2
>>> with a trap command in the jobscript?
>>>
>> The script looks like this:
>> #---
>>
>> trap 'fatal_error "Job has been terminated by the batch system" "TERM"'
>> SIGTERM
>> trap 'fatal_error "Job has been terminated by the batch system" "INT"'
>> SIGINT
>> trap 'fatal_error "Job has been terminated by the batch system" "QUIT"'
>> SIGQUIT
>> trap 'fatal_error "Job has been terminated by the batch system" "ABRT"'
>> SIGABRT
>> trap 'fatal_error "Job has been terminated by the batch system" "USR1"'
>> SIGUSR1
>> trap 'fatal_error "Job has been terminated by the batch system" "USR2"'
>> SIGUSR2
>>
>> echo "OLA"
>>
>> fatal_error() {
>>         echo "hi $1 $2"
>> }
>>
>> sleep 120  &
>> wait $!
>
> Same result, as if you put a simple
>
> sleep 120
>
> there? As mentioned, using & in SGE jobs is always a little bit
> unfavorable.
>
>>
>> #---
>>
>> within the 120 seconds I delete the job, but the output is only "OLA",
>> so no signal was seen by the script.
>>
>> I still suspect my pag script to confuse SGE. If you look again at the
>> PGIDs, they change after the PAG command and again after the coshepherd
>> command. The job scripts do not have the same PGID as the shepards and
>> as the PAG command.
>
> Yes, this might be the source of the behavior; as we don't use it, I
> can't comment it :-(
>
> -- Reuti
>
I have set the pag_cmd variable to /usr/bin/pagsh instead of my script
and now I can delete the job. If I again look at the process tree for
jobs with and without my pag script I can see a difference:

PPID PID PGID SID
     1  3267  3267  2480    0/opt/sge/bin/lx24-x86/sge_execd
--with my pag script:
 3267 30673  3267  2480    0 \_ /bin/sh /opt/sge/util/pag -c exec
/opt/sge/bin/lx24-x86/sge_shepherd -bg
30673 30680 30680  2480    0 |   \_ /opt/sge/bin/lx24-x86/sge_shepherd -bg
30680 30700 30680  2480    0 |       \_
/opt/sge/bin/lx24-x86/sge_coshepherd /opt/sge/util/set_token_cmd duc 86400
30680 30718 30718 30718 1025 |       \_ /bin/bash
/opt/sge/sunfire/spool/silab03/job_scripts/537
30718 30719 30718 30718 1025 |           \_ sleep 2222222
--with /usr/bin/pagsh:
 3267 31164 31164  2480    0 \_ /opt/sge/bin/lx24-x86/sge_shepherd -bg
31164 31177 31164  2480    0     \_ /opt/sge/bin/lx24-x86/sge_coshepherd
/opt/sge/util/set_token_cmd duc 86400
31164 31188 31188 31188 1025     \_ /bin/bash
/opt/sge/sunfire/spool/silab03/job_scripts/538
31188 31189 31188 31188 1025         \_ sleep 2222222

The pag script (PID 30673 ) seems to have the same PGID as the execd.
Maybe this is the source of my problem. Does anyone know how to write
then a good pag script that results in the same PGID structure as for a
job that uses /usr/bin/pagsh?

Duc

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list