[GE users] qmake: error: executing task of job: failed sending task to execd: can't find connection

craffi dag at sonsorol.org
Wed Nov 10 01:28:06 GMT 2010


Cool to see Illumina pipeline using folks popping up here ...

We just spent a bunch of time debugging the same issue, there are a ton 
of knobs to fiddle with but perhaps you are at the same point where we 
are. We did a bunch of SGE tuning but still had the random occasional 
failures very close to what you describe.

This solution below was found by Chris Smith of 
http://distributedbio.com/ (hugely numerous if anyone here knows Chris 
and knows the company he used to work for ...)

 From Chris Smith's notes, edited only to remove the name of the customer:

> The problem was the wrapper script for qmake. The CWD wasn't being propagated properly, so I changed:
>
> qmake -v PATH -inherit --recursive -- ALIGN=YES
>
> to
>
> qmake -v PATH -cwd -inherit -- ALIGN=YES
>
> and "voila" it worked. :-) What's funny is that it took me all day to try this switch. I got a hint because interactive qmake did work properly.
>
> I also believe the '--recursive' option is a throwback to an older version of the pipeline (i.e. it's not a proper option to qmake). I guess people used to invoke the whole Firecrest/Bustard/Gerald chain with a target named "recursive", so your qmake would be:
>
> qmake -v PATH -inherit -- recursive



Hope this helps.


- Chris dag






jbury wrote:
> SGE Version: 6.2u6
> SGE Component: qmake
> Application/Workload: Solexa/Illumina Pipeline
>
> Description:
>
> Solexa uses qmake to farm out hundreds of tasks to the grid.  Every so often a job will encounter a communication error with the execution hosts sge_execd daemon.
>
> Below is an example of a failed jobs error:
>
> error: executing task of job 7692204 failed: failed sending task to execd at hostname: can't find connection
> qmake[1]: *** [Phasing/s_1_01_phasing.txt] Error 1
> qmake[1]: *** Waiting for unfinished jobs....
>
> NOTE: The accounting file for the job shows does not indicate a failure but the exits status is a "2".  Also the messages file does not indicate any of the job tasks failed, it actually logs finished for all tasks.
>
> [tmp]$ qacct -j 7692204
> ==============================================================
> qname        default.q
> hostname     hostname
> group        solexa
> owner        solexa
> project      9614
> department   defaultdepartment
> jobname      qmake.sh
> jobnumber    7692204
> taskid       undefined
> account      sge
> priority     0
> qsub_time    Mon Nov  8 08:00:11 2010
> start_time   Mon Nov  8 08:00:16 2010
> end_time     Mon Nov  8 09:46:42 2010
> granted_pe   make
> slots        8
> failed       0
> exit_status  2
> ru_wallclock 6386
> ru_utime     11290.803
> ru_stime     1202.871
> ru_maxrss    0
> ru_ixrss     0
> ru_ismrss    0
> ru_idrss     0
> ru_isrss     0
> ru_minflt    209475448
> ru_majflt    147
> ru_nswap     0
> ru_inblock   0
> ru_oublock   0
> ru_msgsnd    0
> ru_msgrcv    0
> ru_nsignals  0
> ru_nvcsw     3474688
> ru_nivcsw    299076
> cpu          12493.674
> mem          928.143
> io           133.390
> iow          0.000
> maxvmem      831.389G
> arid         undefined
> [tmp]$
>
> 11/08/2010 08:00:26|worker|master_hostname|I|task 1.hostname1 at hostname1 of job 7692204.1 finished
> 11/08/2010 08:00:26|worker|master_hostname|I|task 1.hostname2 at hostname2 of job 7692204.1 finished
> 				.
> 				.
> 11/08/2010 09:46:40|worker|master_hostname|I|task 903.hostname3 at hostname3 of job 7692204.1 finished
> 11/08/2010 09:46:44|worker|master_hostname|I|removing trigger to terminate job 7692204.1
> 11/08/2010 09:46:44|worker|master_hostname|I|job 7692204.1 finished on host hostname
>
>
> The messages log on the execution host contain multiple of the following messages in regards to the job id:
>
> 11/08/2010 08:42:01|  main|hostname|W|reaping job "7692204" ptf complains: Job does not exist
> 				.
> 				.
> 11/08/2010 09:27:53|  main|hostname|W|reaping job "7692204" ptf complains: Job does not exist
> 11/08/2010 09:46:42|  main|hostname|I|SIGNAL jid: 7692204 jatask: 1 signal: KILL
>
>
> Any feedback, suggestions on how to track this down, etc. would be greatly appreciated.  So far, running the job a second time results in success.
>
> Thanks for your time.
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=294385
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=294406

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list