[GE users] qmake: error: executing task of job: failed sending task to execd: can't find connection

jbury jbury at jcvi.org
Tue Nov 9 21:48:15 GMT 2010

SGE Version: 6.2u6
SGE Component: qmake
Application/Workload: Solexa/Illumina Pipeline


Solexa uses qmake to farm out hundreds of tasks to the grid.  Every so often a job will encounter a communication error with the execution hosts sge_execd daemon.

Below is an example of a failed jobs error:

error: executing task of job 7692204 failed: failed sending task to execd at hostname: can't find connection
qmake[1]: *** [Phasing/s_1_01_phasing.txt] Error 1
qmake[1]: *** Waiting for unfinished jobs....

NOTE: The accounting file for the job shows does not indicate a failure but the exits status is a "2".  Also the messages file does not indicate any of the job tasks failed, it actually logs finished for all tasks.

[tmp]$ qacct -j 7692204
qname        default.q
hostname     hostname
group        solexa
owner        solexa
project      9614
department   defaultdepartment
jobname      qmake.sh
jobnumber    7692204
taskid       undefined
account      sge
priority     0
qsub_time    Mon Nov  8 08:00:11 2010
start_time   Mon Nov  8 08:00:16 2010
end_time     Mon Nov  8 09:46:42 2010
granted_pe   make
slots        8
failed       0
exit_status  2
ru_wallclock 6386
ru_utime     11290.803
ru_stime     1202.871
ru_maxrss    0
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    209475448
ru_majflt    147
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     3474688
ru_nivcsw    299076
cpu          12493.674
mem          928.143
io           133.390
iow          0.000
maxvmem      831.389G
arid         undefined

11/08/2010 08:00:26|worker|master_hostname|I|task 1.hostname1 at hostname1 of job 7692204.1 finished
11/08/2010 08:00:26|worker|master_hostname|I|task 1.hostname2 at hostname2 of job 7692204.1 finished
11/08/2010 09:46:40|worker|master_hostname|I|task 903.hostname3 at hostname3 of job 7692204.1 finished
11/08/2010 09:46:44|worker|master_hostname|I|removing trigger to terminate job 7692204.1
11/08/2010 09:46:44|worker|master_hostname|I|job 7692204.1 finished on host hostname

The messages log on the execution host contain multiple of the following messages in regards to the job id:

11/08/2010 08:42:01|  main|hostname|W|reaping job "7692204" ptf complains: Job does not exist
11/08/2010 09:27:53|  main|hostname|W|reaping job "7692204" ptf complains: Job does not exist
11/08/2010 09:46:42|  main|hostname|I|SIGNAL jid: 7692204 jatask: 1 signal: KILL

Any feedback, suggestions on how to track this down, etc. would be greatly appreciated.  So far, running the job a second time results in success.

Thanks for your time.


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list