[GE users] qmake job and errors: pid: No such file or directory
heywood at cshl.edu
Wed Jan 23 15:21:00 GMT 2008
I saw Sean's message and intended to respond, but "lost" it in my SGE
mailbox. We also run the Solexa pipeline, with virtually the same cluster
configuration (lx24-amd64, although on Opterons). I've hit a simlar (same?)
These errors occur every now and then, and on some random node. Re-running
the pipeline usually succeeds, but a failed run an be expensive in terms of
time and resources.
I managed to reduce, but not eliminate, the number of these errors by
mounting NFS via tcp instead of udp, and by adding options to ssh in the SGE
rsh_command /usr/bin/ssh -o ConnectionAttempts=5 -o \
On 1/21/08 2:42 PM, "Chris Dagdigian" <dag at sonsorol.org> wrote:
> Did you ever solve your Solexa pipeline and SGE issues? I did some
> work in December on a lab environment that needed to spin up its
> cluster and storage resources in order to handle the arrival of a 2nd
> Solexa instrument. I did the storage, cluster and SGE work though
> without looking too closely at the Solexa toolset.
> The issue we had with Solexa was that the pipeline was built on qmake
> and seemed biased towards synchronous use by a single person on a
> dedicated system -- no easy way to batch submit a job via qsub and let
> it pend asynchronously until resources are available (something that
> will be needed in a multi-user, multi-instrument environment). I think
> people have worked around this by now using qsub wrappers over the
> qmake commands but I'm not sure.
> It is very cool (and possibly not known much within the SGE community)
> how so much of the "next generation DNA sequencing" business is being
> built on top of Grid Engine. Solexa uses 'qmake' under the hood for
> their runs and indications are that the new Helicos instruments are
> also going to have analytical workflows that run off of Grid Engine.
> This is going to be big in 2008 - I've been thinking about setting up
> a wiki or mailing list specifically for "lab instruments that require
> Grid Engine" so I'm on the hunt for people interested in the topic.
> On a slightly related side note -- if you are going to be in the
> Boston area in late april we are organizing a 1-day workshop on "next-
> gen sequencing" with particular focus on the data handling and
> migration problems that smaller labs are beginning to get bitten with.
> I won't spam the details here but we've got the info posted up on
> now. We are trying to get as many sequencing and IT types as we can
> into the same room for some practical "what the heck do we do with
> terabyte capable lab instruments" talks.
> Creating a "DNA Sequencers shipping with SGE" article and summary is
> also on the personal to-do list for gridengine.info as well.
> On Dec 30, 2007, at 5:19 PM, Sean Davis wrote:
>> On Dec 30, 2007 10:29 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
>> Am 29.12.2007 um 15:05 schrieb Sean Davis:
>>> On Dec 29, 2007 7:58 AM, Reuti < reuti at staff.uni-marburg.de> wrote:
>>> Am 29.12.2007 um 00:11 schrieb Sean Davis:
>>>> On Dec 28, 2007 5:16 PM, Reuti < reuti at staff.uni-marburg.de> wrote:
>>>> Am 28.12.2007 um 22:15 schrieb Sean Davis:
>>>>> I am trying to run qmake on the Solexa analysis pipeline (probably
>>>>> not important, but....). When I run this command, I get the
>>>>> following error:
>>>>> can't open file /tmp/285.1.all.q/pid: No such file or directory
>>>>> I have the shepherd trace available, also. Does this problem ring
>>>>> any bells for anyone?
>>>>> As for details of our setup, we have several linux boxes running
>>>>> the lx24-amd64 binaries (though they are intel machines). All are
>>>>> using ssh for communication. None has a firewall enabled. They
>>>>> are using shared home directories, but /tmp, etc., are local to
>>>>> machines. Qlogin, qrsh, and qsh seem to be working. We have
>>>>> openmpi installed, also.
>>>>> I know the question is pretty vague. I am pretty new to SGE, so
>>>>> debugging these issues is pretty new also. Any guidance is
>>>> usually the 'pid' file should go to a spool directory like:
>>>> or better:
>>>> and not the local tmp directory for the job. Where is the spool
>>>> directory for the node: local or on the NFS server? Best would be to
>>>> have it local on all nodes:
>>>> Thanks, Reuti and John.
>>>> The /tmp directory is world read and write, just to make certain.
>>>> How can I set the location of the pid file? Is there a convenient
>>>> way to check where the local spooling for each node is located?
>>>> Here is a bit of the shepherd trace, as it seems like the local
>>>> spool directory is used, but also the tmp directory on the qmaster:
>>>> 2/28/2007 16:05:54 [10020:25961]: setting environment
>>>> 12/28/2007 16:05:54 [10020:25961]: Initializing error file
>>>> 12/28/2007 16:05:54 [10020:25958]: forked "job" with pid 25961
>>>> 12/28/2007 16:05:54 [10020:25958]: child: job - pid: 25961
>>>> 12/28/2007 16:05:54 [10020:25961]: switching to intermediate/
>>>> target user
>>>> 12/28/2007 16:05:54 [10005:25961]: closing all filedescriptors
>>>> 12/28/2007 16:05:54 [10005:25961]: further messages are in "error"
>>>> and "trace"
>>>> 12/28/2007 16:05:54 [0:25961]: now running with uid=0, euid=0
>>>> 12/28/2007 16:05:54 [0:25961]: start qlogin
>>>> 12/28/2007 16:05:54 [0:25961]: calling qlogin_starter(/var/spool/sge
>>>> /local/pressa/active_jobs/285.1, /usr/sbin/sshd -i);
>>>> 12/28/2007 16:05:54 [0:25961]: uid = 0, euid = 0, gid = 0, egid = 0
>>>> 12/28/2007 16:05:54 [0:25961]: using sfd 1
>>>> 12/28/2007 16:05:54 [0:25961]: bound to port 54613
>>>> 12/28/2007 16:05:54 [0:25961]: write_to_qrsh - data = 0:54613:/usr/
>>>> 12/28/2007 16:05:54 [0:25961]: write_to_qrsh - address =
>>>> 12/28/2007 16:05:54 [0:25961]: write_to_qrsh - host = shakespeare,
>>>> port = 44242
>>>> 12/28/2007 16:05:54 [0:25961]: waiting for connection.
>>>> 12/28/2007 16:06:54 [0:25961]: nobody connected to the socket
>>> What are the protections of your var/spool/sge/local/pressa/
>>> active_jobs and who is the owner? Who is the admin user of your SGE
>>> installation (the one who owns /usr/local/sge)?
>>> root owns /usr/local/sge--this could be a problem, since the
>>> partition is mounted with root-squash. However, sgeadmin is the
>>> admin user (the user under which sge_execd, sge_qmaster, etc.
>>> run). As for /var/spool/sge/local/pressa/active_jobs, the owner is
>>> sgeadmin; everyone can r-x, owner can rwx.
>>> I have changed the ownership of /usr/local/sge to sgeadmin, but the
>>> problem seems to remain.
>> Are these serial or parallel jobs? With root_squash I see a problem:
>> what are the settings in $SGE_ROOT/utilbin/lx24-amd64 for:
>> -rwsr-xr-x 1 root root 32K Oct 20 2006 rlogin
>> -rwsr-xr-x 1 root root 22K Oct 20 2006 rsh
>> -rwsr-xr-x 1 root root 23K Oct 20 2006 testsuidroot
>> They must have a setuid as they must run as root.
>> I have fixed these and the root_squash issue. The jobs are qmake
>> jobs that look like:
>> qmake -pe Test 12-16 -cwd -v PATH -- -j 16
>> I continue to get error messages related to not finding /tmp/<jobid>/
>> pid. I don't know whether this is a symptom or a cause.
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users