[GE users] qdel dows not kill the job when using pag command?

Reuti reuti at staff.uni-marburg.de
Thu May 31 23:39:45 BST 2007


Am 31.05.2007 um 14:54 schrieb Duc Bao Ta:

> Hi Reuti,
>
> I always try to delete the jobs with qdel and if after that nothing
> happens I try qdel -f.
> In the logs with log_level info I only get these messages on the  
> exec node:
>
> 05/31/2007 14:51:26|execd|silab03|I|SIGNAL jid: 472 jatask: 1  
> signal: KILL
> 05/31/2007 14:51:33|execd|silab03|I|SIGNAL jid: 472 jatask: 1  
> signal: KILL
>
> now with qdel -f:
> 05/31/2007 14:52:13|execd|silab03|I|SIGNAL jid: 472 jatask: 1  
> signal: KILL
> 05/31/2007 14:52:53|execd|silab03|I|SIGNAL jid: 472 jatask: 1  
> signal: KILL
> 05/31/2007 14:53:14|execd|silab03|I|SIGNAL jid: 472 jatask: 1  
> signal: KILL
>
> I cannot tell if the SIGKILL is applied to the correct pid.

We can try two things: what if you suspend the job - which processes  
are then in status "T"?

Second: can you submit the jobs with -notify and catch the SIGUSR2  
with a trap command in the jobscript?

-- Reuti


> Cheers
> Duc
>
> Reuti schrieb:
>> Am 30.05.2007 um 10:27 schrieb Duc Bao Ta:
>>
>>> Hi,
>>>
>>> I have read the postings about how qdel kills a job, but what  
>>> process
>>> does it kill, i.e. which process group does it kill?
>>> My problem is that qdel does not delete the job, but the jobs  
>>> remains in
>>> "dr" state. When I look at the process tree I can see the  
>>> following (I
>>> hope it is readable):
>>>
>>> USER PPID   PID  PGID   SID TPGID STAT   UID  COMMAND
>>> root     1  3267  3267  2480    -1 S        0
>>> /opt/sge/bin/lx24-x86/sge_execd
>>> root  3267  3273  3267  2480    -1 S        0   \_ /bin/sh
>>> /opt/sge/util/loadsensor.sh
>>> root  3267 19737  3267  2480    -1 S        0   \_ /bin/sh
>>> /opt/sge/util/pag -c exec /opt/sge/bin/lx24-x86/sge_shepherd -bg
>>> root 19737 19741 19741  2480    -1 S        0   |   \_
>>> /opt/sge/bin/lx24-x86/sge_shepherd -bg
>>> root 19741 19754 19741  2480    -1 S        0   |       \_
>>> /opt/sge/bin/lx24-x86/sge_coshepherd /opt/sge/util/set_token_cmd duc
>>> 86400
>>> duc 19741 19981 19981 19981    -1 SNs   1025   |       \_ /bin/bash
>>> /opt/sge/sunfire/spool/silab03/job_scripts/281
>>> duc 19981 19983 19981 19981    -1 SN    1025   |           \_ sleep
>>> 2222222
>>> root 3267 21297  3267  2480    -1 S        0   \_ /bin/sh
>>> /opt/sge/util/pag -c exec /opt/sge/bin/lx24-x86/sge_shepherd -bg
>>> root 21297 21301 21301  2480    -1 S        0       \_
>>> /opt/sge/bin/lx24-x86/sge_shepherd -bg
>>> root 21301 21314 21301  2480    -1 S        0           \_
>>> /opt/sge/bin/lx24-x86/sge_coshepherd /opt/sge/util/set_token_cmd duc
>>> 86400
>>> duc 21301 21698 21698 21698    -1 SNs   1025           \_ /bin/bash
>>> /opt/sge/sunfire/spool/silab03/job_scripts/294
>>> duc 21698 21699 21698 21698    -1 SN    1025               \_ sleep
>>> 2222222
>>>
>>> There are two jobs, still running after a forced deletion as a  
>>> manager
>>
>> For the first job should be killed with -19981, hence the bash and  
>> the
>> sleep. Can you check in the messages file of SGE in the spool
>> directory for this node, whether it was issued (maybe loglevel has to
>> be set to "loglevel log_info" in the SGE configuration).
>>
>> Did you also try first a qdel without -f?
>>
>> -- Reuti
>>
>>
>>> user. I am using the set_token_cmd and pag_cmd options to get my
>>> kerberos tickets and afs tokens, so I rely on this job execution  
>>> scheme.
>>> Basically
>>>
>>>
>>> When I kill manually (SIGTERM and SIGKILL) as root the "job_scripts"
>>> processes, then the jobs terminates as desired (i.e. epilog  
>>> script is
>>> executed), when I try to kill the set_token_cmd nothing happens,  
>>> when I
>>> kill the "sge_shepard" -bg process the jobs terminates directly  
>>> without
>>> calling the epilog script.
>>> Will the terminate method of the queue help here? Or should I  
>>> modify the
>>> set_token_cmd and pag_cmd scripts?
>>>
>>>
>>> Cheers
>>> Duc
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list