[GE users] Qmake and Error 129 (HUP)

jteer teerj at mail.nih.gov
Thu Mar 4 13:47:05 GMT 2010


We have been using qmake to manage a data analysis pipeline, and have been quite satified.  We generally run qmake as:
qmake -- -j50
which allows scheduling of individual jobs as well as using unique SGE_RREQ values for each command.  Recently, we have been getting the following error:
qmake: *** [<file>] Error 129
qmake: *** Waiting for unfinished jobs....
Probably as expected, other concurrent jobs continue to run until finished, but no new commands are executed, and qmake must be rerun in order to complete the analysis.  Interestingly, it doesn't look like any error occurs.  Our commands are generally like:
script && echo `date` $$JOB_ID > logfile
The script appears to complete, and the logfile is always created.  The qacct readout for the job includes:
failed       100 : assumedly after job
exit_status  129                 
and the execd message indicates:
<date> <time>|worker|<host>|W|job <jobid> failed on host <host> assumedly after job because: job <jobid>
 died through signal HUP (1)
These errors are becoming more troubling, as we get 2-5 per analysis, and we generally want to fire off 5-10 analyses recursively.  The key is that this should generally happen automatically unless there is a real error. I can't figure out what these errors are, but its making this automation strategy unfeasable.  Most of the data does live on several NFS servers, but this problem is only recent, and so not necessarily an NFS issue (although possibly an NFS settings issue?) We're running CentOS 4.7, and SGE 6.2.  Any ideas? Thanks!

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247062

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list