[GE users] problem of qdel and parallel running in SGE

jenny lulh at genomics.org.cn
Fri Aug 20 11:23:21 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

yep,  since it has been reinstalled,when the execd started, it can't find the jobs on the spool on local disk. and qmaster didn't remove the jobs either.

08/20/2010 16:39:04|  main|compute-0-43|I|starting up GE 6.2u4 (lx26-amd64)
08/20/2010 16:39:25|  main|compute-0-43|E|received task belongs to job 228973 but this job is not here
08/20/2010 16:39:25|  main|compute-0-43|E|received task belongs to job 228973 but this job is not here
08/20/2010 16:39:25|  main|compute-0-43|E|acknowledge for unknown job 228973.1/master
08/20/2010 16:39:25|  main|compute-0-43|E|can't find active jobs directory "active_jobs/228973.1" for reaping job 228973
08/20/2010 16:39:25|  main|compute-0-43|E|ERROR: unlinking "jobs/00/0022/8973.1": No such file or directory
08/20/2010 16:39:25|  main|compute-0-43|E|can not remove file job spool file: jobs/00/0022/8973.1
08/20/2010 16:39:25|  main|compute-0-43|E|can't remove directory "active_jobs/228973.1": opendir(active_jobs/228973.1) failed
: No such file or directory


2010-08-20
________________________________
???  Jenny_Lu
????
?????????
lulh at genomics.org.cn<mailto:lulh at genomics.org.cn>
Tel:075525273811
Mobile:15986782583  62583
________________________________
???? reuti
????? 2010-08-20  18:17:54
???? users
???
??? Re: [GE users] problem of qdel and parallel running in SGE
Am 20.08.2010 um 11:05 schrieb jenny:
> i met a problem: after reinstall the compute nodes, the jobs on the compute nodes didn't disappear.
>
> # qstat -u "*" | grep 0-29
>  127153 0.85000 yri.yh.170 a  r     08/11/2010 03:23:37 all.q at compute-0-29.local
>         1
>  127179 0.85000 yri.yh.194 a  r     08/11/2010 03:50:37 all.q at compute-0-29.local
>         1
>  215868 0.26138 chr2.sh    a   r     08/19/2010 09:08:45 all.q at compute-0-29.local
>         1
>  216640 0.26069 job1.sh  b       r     08/19/2010 10:54:45 all.q at compute-0-29.local
>         1
Yep, SGE thinks it's a network problem and waits for the execds. As you reinstalled the nodes, maybe they are not aware that there was something running before.
$ qdel -f 127153
should help. (But this is only a last resort. Normally the execds should check the list of jobs when they come up again on the nodes against the jobs the qmaster knows about.)
-- Reuti
>  2010-08-20
> ???  Jenny_Lu
> ????
> ?????????
> lulh at genomics.org.cn
> Tel:075525273811
> Mobile:15986782583  62583
> ???? reuti
> ????? 2010-08-18  15:40:03
> ???? users
> ???
> ??? Re: [GE users] problem of qdel and parallel running in SGE
> Hi,
> Am 18.08.2010 um 07:05 schrieb mrostaee:
> > sometimes after qdel a job, process of that job will be running. it means that qdel hasn't done completely (kill of all deleted job's processes not done).
> >
> > If i want diagnose this situation and solve problem automatically, how can i do it?
> >
> > Is there any script for this problem?
> this sounds as if you don't have a so called Tight Integration of your parallel jobs into SGE:
> http://gridengine.info/2005/09/19/parallel-environments-pes-loose-vs-tight-integration
> There are some Howto's for most common parallel libraries which are also mentioned in the above link:
> http://gridengine.sunsource.net/howto/howto.html#Tight%20Integration%20of%20Parallel%20Libraries
> -- Reuti
> > Thx
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=275091
> >
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> >
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=275132
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> __________ Information from ESET NOD32 Antivirus, version of virus signature database 5374 (20100817) __________
> The message was checked by ESET NOD32 Antivirus.
> http://www.eset.com
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=275649
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
__________ Information from ESET NOD32 Antivirus, version of virus signature database 5381 (20100820) __________
The message was checked by ESET NOD32 Antivirus.
http://www.eset.com



More information about the gridengine-users mailing list