[GE users] qdel dows not kill the job when using pag command?

Reuti reuti at staff.uni-marburg.de
Fri Jun 1 12:36:10 BST 2007


Am 01.06.2007 um 11:28 schrieb Duc Bao Ta:

> Reuti schrieb:
>> Am 31.05.2007 um 14:54 schrieb Duc Bao Ta:
>>
>>> Hi Reuti,
>>>
>>> I always try to delete the jobs with qdel and if after that nothing
>>> happens I try qdel -f.
>>> In the logs with log_level info I only get these messages on the  
>>> exec
>>> node:
>>>
>>> 05/31/2007 14:51:26|execd|silab03|I|SIGNAL jid: 472 jatask: 1  
>>> signal:
>>> KILL
>>> 05/31/2007 14:51:33|execd|silab03|I|SIGNAL jid: 472 jatask: 1  
>>> signal:
>>> KILL
>>>
>>> now with qdel -f:
>>> 05/31/2007 14:52:13|execd|silab03|I|SIGNAL jid: 472 jatask: 1  
>>> signal:
>>> KILL
>>> 05/31/2007 14:52:53|execd|silab03|I|SIGNAL jid: 472 jatask: 1  
>>> signal:
>>> KILL
>>> 05/31/2007 14:53:14|execd|silab03|I|SIGNAL jid: 472 jatask: 1  
>>> signal:
>>> KILL
>>>
>>> I cannot tell if the SIGKILL is applied to the correct pid.
>>
>> We can try two things: what if you suspend the job - which processes
>> are then in status "T"?
>>
> Actually if I don't see that the job script is suspended, with htop I
> can see that the script ocasionally changes the status from R to D.  
> This
> lasts only for a few seconds.

D is something like delayed because of I/O and counts as running  
regarding the load of a machine.

>> Second: can you submit the jobs with -notify and catch the SIGUSR2
>> with a trap command in the jobscript?
>>
> The script looks like this:
> #---
>
> trap 'fatal_error "Job has been terminated by the batch system"  
> "TERM"'
> SIGTERM
> trap 'fatal_error "Job has been terminated by the batch system" "INT"'
> SIGINT
> trap 'fatal_error "Job has been terminated by the batch system"  
> "QUIT"'
> SIGQUIT
> trap 'fatal_error "Job has been terminated by the batch system"  
> "ABRT"'
> SIGABRT
> trap 'fatal_error "Job has been terminated by the batch system"  
> "USR1"'
> SIGUSR1
> trap 'fatal_error "Job has been terminated by the batch system"  
> "USR2"'
> SIGUSR2
>
> echo "OLA"
>
> fatal_error() {
>         echo "hi $1 $2"
> }
>
> sleep 120  &
> wait $!

Same result, as if you put a simple

sleep 120

there? As mentioned, using & in SGE jobs is always a little bit  
unfavorable.

>
> #---
>
> within the 120 seconds I delete the job, but the output is only "OLA",
> so no signal was seen by the script.
>
> I still suspect my pag script to confuse SGE. If you look again at the
> PGIDs, they change after the PAG command and again after the  
> coshepherd
> command. The job scripts do not have the same PGID as the shepards and
> as the PAG command.

Yes, this might be the source of the behavior; as we don't use it, I  
can't comment it :-(

-- Reuti


> Duc
>
>> -- Reuti
>>
>>
>>> Cheers
>>> Duc
>>>
>>> Reuti schrieb:
>>>> Am 30.05.2007 um 10:27 schrieb Duc Bao Ta:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have read the postings about how qdel kills a job, but what  
>>>>> process
>>>>> does it kill, i.e. which process group does it kill?
>>>>> My problem is that qdel does not delete the job, but the jobs
>>>>> remains in
>>>>> "dr" state. When I look at the process tree I can see the  
>>>>> following (I
>>>>> hope it is readable):
>>>>>
>>>>> USER PPID   PID  PGID   SID TPGID STAT   UID  COMMAND
>>>>> root     1  3267  3267  2480    -1 S        0
>>>>> /opt/sge/bin/lx24-x86/sge_execd
>>>>> root  3267  3273  3267  2480    -1 S        0   \_ /bin/sh
>>>>> /opt/sge/util/loadsensor.sh
>>>>> root  3267 19737  3267  2480    -1 S        0   \_ /bin/sh
>>>>> /opt/sge/util/pag -c exec /opt/sge/bin/lx24-x86/sge_shepherd -bg
>>>>> root 19737 19741 19741  2480    -1 S        0   |   \_
>>>>> /opt/sge/bin/lx24-x86/sge_shepherd -bg
>>>>> root 19741 19754 19741  2480    -1 S        0   |       \_
>>>>> /opt/sge/bin/lx24-x86/sge_coshepherd /opt/sge/util/ 
>>>>> set_token_cmd duc
>>>>> 86400
>>>>> duc 19741 19981 19981 19981    -1 SNs   1025   |       \_ /bin/ 
>>>>> bash
>>>>> /opt/sge/sunfire/spool/silab03/job_scripts/281
>>>>> duc 19981 19983 19981 19981    -1 SN    1025   |           \_  
>>>>> sleep
>>>>> 2222222
>>>>> root 3267 21297  3267  2480    -1 S        0   \_ /bin/sh
>>>>> /opt/sge/util/pag -c exec /opt/sge/bin/lx24-x86/sge_shepherd -bg
>>>>> root 21297 21301 21301  2480    -1 S        0       \_
>>>>> /opt/sge/bin/lx24-x86/sge_shepherd -bg
>>>>> root 21301 21314 21301  2480    -1 S        0           \_
>>>>> /opt/sge/bin/lx24-x86/sge_coshepherd /opt/sge/util/ 
>>>>> set_token_cmd duc
>>>>> 86400
>>>>> duc 21301 21698 21698 21698    -1 SNs   1025           \_ /bin/ 
>>>>> bash
>>>>> /opt/sge/sunfire/spool/silab03/job_scripts/294
>>>>> duc 21698 21699 21698 21698    -1 SN    1025               \_  
>>>>> sleep
>>>>> 2222222
>>>>>
>>>>> There are two jobs, still running after a forced deletion as a  
>>>>> manager
>>>>
>>>> For the first job should be killed with -19981, hence the bash  
>>>> and the
>>>> sleep. Can you check in the messages file of SGE in the spool
>>>> directory for this node, whether it was issued (maybe loglevel  
>>>> has to
>>>> be set to "loglevel log_info" in the SGE configuration).
>>>>
>>>> Did you also try first a qdel without -f?
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> user. I am using the set_token_cmd and pag_cmd options to get my
>>>>> kerberos tickets and afs tokens, so I rely on this job execution
>>>>> scheme.
>>>>> Basically
>>>>>
>>>>>
>>>>> When I kill manually (SIGTERM and SIGKILL) as root the  
>>>>> "job_scripts"
>>>>> processes, then the jobs terminates as desired (i.e. epilog  
>>>>> script is
>>>>> executed), when I try to kill the set_token_cmd nothing happens,
>>>>> when I
>>>>> kill the "sge_shepard" -bg process the jobs terminates directly
>>>>> without
>>>>> calling the epilog script.
>>>>> Will the terminate method of the queue help here? Or should I
>>>>> modify the
>>>>> set_token_cmd and pag_cmd scripts?
>>>>>
>>>>>
>>>>> Cheers
>>>>> Duc
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- 
>>>>> help at gridengine.sunsource.net
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list