[GE users] qdel dows not kill the job when using pag command?

Duc Bao Ta ta at physik.uni-bonn.de
Fri Jun 1 14:00:08 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti schrieb:
> Am 01.06.2007 um 11:28 schrieb Duc Bao Ta:
>
>> Reuti schrieb:
>>> Am 31.05.2007 um 14:54 schrieb Duc Bao Ta:
>>>
>>>> Hi Reuti,
>>>>
>>>> I always try to delete the jobs with qdel and if after that nothing
>>>> happens I try qdel -f.
>>>> In the logs with log_level info I only get these messages on the exec
>>>> node:
>>>>
>>>> 05/31/2007 14:51:26|execd|silab03|I|SIGNAL jid: 472 jatask: 1 signal:
>>>> KILL
>>>> 05/31/2007 14:51:33|execd|silab03|I|SIGNAL jid: 472 jatask: 1 signal:
>>>> KILL
>>>>
>>>> now with qdel -f:
>>>> 05/31/2007 14:52:13|execd|silab03|I|SIGNAL jid: 472 jatask: 1 signal:
>>>> KILL
>>>> 05/31/2007 14:52:53|execd|silab03|I|SIGNAL jid: 472 jatask: 1 signal:
>>>> KILL
>>>> 05/31/2007 14:53:14|execd|silab03|I|SIGNAL jid: 472 jatask: 1 signal:
>>>> KILL
>>>>
>>>> I cannot tell if the SIGKILL is applied to the correct pid.
>>>
>>> We can try two things: what if you suspend the job - which processes
>>> are then in status "T"?
>>>
>> Actually if I don't see that the job script is suspended, with htop I
>> can see that the script ocasionally changes the status from R to D. This
>> lasts only for a few seconds.
>
> D is something like delayed because of I/O and counts as running
> regarding the load of a machine.
>
>>> Second: can you submit the jobs with -notify and catch the SIGUSR2
>>> with a trap command in the jobscript?
>>>
>> The script looks like this:
>> #---
>>
>> trap 'fatal_error "Job has been terminated by the batch system" "TERM"'
>> SIGTERM
>> trap 'fatal_error "Job has been terminated by the batch system" "INT"'
>> SIGINT
>> trap 'fatal_error "Job has been terminated by the batch system" "QUIT"'
>> SIGQUIT
>> trap 'fatal_error "Job has been terminated by the batch system" "ABRT"'
>> SIGABRT
>> trap 'fatal_error "Job has been terminated by the batch system" "USR1"'
>> SIGUSR1
>> trap 'fatal_error "Job has been terminated by the batch system" "USR2"'
>> SIGUSR2
>>
>> echo "OLA"
>>
>> fatal_error() {
>>         echo "hi $1 $2"
>> }
>>
>> sleep 120  &
>> wait $!
>
> Same result, as if you put a simple
>
> sleep 120
>
> there? As mentioned, using & in SGE jobs is always a little bit
> unfavorable.
>

Sorry, I read that somewhere, but I forgot to remove the &. Still,
without "&" I get the same behaviour.
>>
>> #---
>>
>> within the 120 seconds I delete the job, but the output is only "OLA",
>> so no signal was seen by the script.
>>
>> I still suspect my pag script to confuse SGE. If you look again at the
>> PGIDs, they change after the PAG command and again after the coshepherd
>> command. The job scripts do not have the same PGID as the shepards and
>> as the PAG command.
>
> Yes, this might be the source of the behavior; as we don't use it, I
> can't comment it :-(
>
My pag script is simply this:
#--
export KRB5CCNAME=/tmp/ticket.`echo $PWD | awk -F / '{ print $NF}'`
/usr/bin/pagsh -c "$2"
#--

I need this to advertise the kerberos ticket name in the job and $2
contains "shepard -bg" to start the shepard for the job

I am out of ideas now,

Duc
> -- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list