[GE users] qmake job and errors: pid: No such file or directory

Heywood, Todd heywood at cshl.edu
Wed Jan 23 15:21:00 GMT 2008


Hi,

I saw Sean's message and intended to respond, but "lost" it in my SGE
mailbox. We also run the Solexa pipeline, with virtually the same cluster
configuration (lx24-amd64, although on Opterons). I've hit a simlar (same?)
issue:

http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=20496

http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=20509

These errors occur every now and then, and on some random node. Re-running
the pipeline usually succeeds, but a failed run an be expensive in terms of
time and resources.

I managed to reduce, but not eliminate, the number of these errors by
mounting NFS via tcp instead of udp, and by adding options to ssh in the SGE
config:

rsh_command                  /usr/bin/ssh -o ConnectionAttempts=5 -o \
                             ConnectTimeout=60

Todd Heywood



On 1/21/08 2:42 PM, "Chris Dagdigian" <dag at sonsorol.org> wrote:

> 
> Sean,
> 
> Did you ever solve your Solexa pipeline and SGE issues? I did some
> work in December on a lab environment that needed to spin up its
> cluster and storage resources in order to handle the arrival of a 2nd
> Solexa instrument. I did the storage, cluster and SGE work though
> without looking too closely at the Solexa toolset.
> 
> The issue we had with Solexa was that the pipeline was built on qmake
> and seemed biased towards synchronous use by a single person on a
> dedicated system -- no easy way to batch submit a job via qsub and let
> it pend asynchronously until resources are available (something that
> will be needed in a multi-user, multi-instrument environment). I think
> people have worked around this by now using qsub wrappers over the
> qmake commands but I'm not sure.
> 
> It is very cool (and possibly not known much within the SGE community)
> how so much of the "next generation DNA sequencing" business is being
> built on top of Grid Engine. Solexa uses 'qmake' under the hood for
> their runs and indications are that the new Helicos instruments are
> also going to have analytical workflows that run off of Grid Engine.
> 
> This is going to be big in 2008 - I've been thinking about setting up
> a wiki or mailing list specifically for "lab instruments that require
> Grid Engine" so I'm on the hunt for people interested in the topic.
> 
> On a slightly related side note -- if you are going to be in the
> Boston area in late april we are organizing a 1-day workshop on "next-
> gen sequencing" with particular focus on the data handling and
> migration problems that smaller labs are beginning to get bitten with.
> I won't spam the details here but we've got the info posted up on
> http://blog.bioteam.net
>   now.  We are trying to get as many sequencing and IT types as we can
> into the same room for some practical "what the heck do we do with
> terabyte capable lab instruments" talks.
> 
> Creating a "DNA Sequencers shipping with SGE" article and summary is
> also on the personal to-do list for gridengine.info as well.
> 
> Regards,
> Chris
> 
> 
> 
> 
> On Dec 30, 2007, at 5:19 PM, Sean Davis wrote:
> 
>> 
>> 
>> On Dec 30, 2007 10:29 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
>> Am 29.12.2007 um 15:05 schrieb Sean Davis:
>> 
>>> On Dec 29, 2007 7:58 AM, Reuti < reuti at staff.uni-marburg.de> wrote:
>>> Am 29.12.2007 um 00:11 schrieb Sean Davis:
>>> 
>>>> On Dec 28, 2007 5:16 PM, Reuti < reuti at staff.uni-marburg.de> wrote:
>>>> Hi,
>>>> 
>>>> Am 28.12.2007 um 22:15 schrieb Sean Davis:
>>>> 
>>>>> I am trying to run qmake on the Solexa analysis pipeline (probably
>>>>> not important, but....).  When I run this command, I get the
>>>>> following error:
>>>>> 
>>>>> can't open file /tmp/285.1.all.q/pid: No such file or directory
>>>>> 
>>>>> I have the shepherd trace available, also.  Does this problem ring
>>>>> any bells for anyone?
>>>>> 
>>>>> As for details of our setup, we have several linux boxes running
>>>>> the lx24-amd64 binaries (though they are intel machines).  All are
>>>>> using ssh for communication.  None has a firewall enabled.  They
>>>>> are using shared home directories, but /tmp, etc., are local to
>>>> the
>>>>> machines.  Qlogin, qrsh, and qsh seem to be working.  We have
>>>>> openmpi installed, also.
>>>>> 
>>>>> I know the question is pretty vague.  I am pretty new to SGE, so
>>>>> debugging these issues is pretty new also.  Any guidance is
>>>>> appreciated.
>>>> 
>>>> usually the 'pid' file should go to a spool directory like:
>>>> 
>>>> $SGE_ROOT/spool/sge/<node>/active_jobs/<job_id>.<task_id>
>>>> 
>>>> or better:
>>>> 
>>>> /var/spool/sge/<node>/active_jobs/<job_id>.<task_id>
>>>> 
>>>> and not the local tmp directory for the job. Where is the spool
>>>> directory for the node: local or on the NFS server? Best would be to
>>>> have it local on all nodes:
>>>> 
>>>> http://gridengine.sunsource.net/howto/nfsreduce.html
>>>> 
>>>> Thanks, Reuti and John.
>>>> 
>>>> The /tmp directory is world read and write, just to make certain.
>>>> 
>>>> How can I set the location of the pid file?  Is there a convenient
>>>> way to check where the local spooling for each node is located?
>>>> 
>>>> Here is a bit of the shepherd trace, as it seems like the local
>>>> spool directory is used, but also the tmp directory on the qmaster:
>>>> 
>>>> 2/28/2007 16:05:54 [10020:25961]: setting environment
>>>> 12/28/2007 16:05:54 [10020:25961]: Initializing error file
>>>> 12/28/2007 16:05:54 [10020:25958]: forked "job" with pid 25961
>>>> 12/28/2007 16:05:54 [10020:25958]: child: job - pid: 25961
>>>> 12/28/2007 16:05:54 [10020:25961]: switching to intermediate/
>>>> target user
>>>> 12/28/2007 16:05:54 [10005:25961]: closing all filedescriptors
>>>> 12/28/2007 16:05:54 [10005:25961]: further messages are in "error"
>>>> and "trace"
>>>> 12/28/2007 16:05:54 [0:25961]: now running with uid=0, euid=0
>>>> 12/28/2007 16:05:54 [0:25961]: start qlogin
>>>> 12/28/2007 16:05:54 [0:25961]: calling qlogin_starter(/var/spool/sge
>>>> /local/pressa/active_jobs/285.1, /usr/sbin/sshd -i);
>>>> 12/28/2007 16:05:54 [0:25961]: uid = 0, euid = 0, gid = 0, egid = 0
>>>> 12/28/2007 16:05:54 [0:25961]: using sfd 1
>>>> 12/28/2007 16:05:54 [0:25961]: bound to port 54613
>>>> 12/28/2007 16:05:54 [0:25961]: write_to_qrsh - data = 0:54613:/usr/
>>>> local/sge/utilbin/lx24-amd64:/var/spool/sge/local/pressa/
>>>> active_jobs/285.1:pressa
>>>> 12/28/2007 16:05:54 [0:25961]: write_to_qrsh - address =
>>>> shakespeare:44242
>>>> 12/28/2007 16:05:54 [0:25961]: write_to_qrsh - host = shakespeare,
>>>> port = 44242
>>>> 12/28/2007 16:05:54 [0:25961]: waiting for connection.
>>>> 12/28/2007 16:06:54 [0:25961]: nobody connected to the socket
>>> 
>>> What are the protections of your var/spool/sge/local/pressa/
>>> active_jobs and who is the owner? Who is the admin user of your SGE
>>> installation (the one who owns /usr/local/sge)?
>>> 
>>> Reuti,
>>> 
>>> root owns /usr/local/sge--this could be a problem, since the
>>> partition is mounted with root-squash.  However, sgeadmin is the
>>> admin user (the user under which sge_execd, sge_qmaster, etc.
>>> run).  As for /var/spool/sge/local/pressa/active_jobs, the owner is
>>> sgeadmin; everyone can r-x, owner can rwx.
>>> 
>>> I have changed the ownership of /usr/local/sge to sgeadmin, but the
>>> problem seems to remain.
>> 
>> 
>> Are these serial or parallel jobs? With root_squash I see a problem:
>> what are the settings in $SGE_ROOT/utilbin/lx24-amd64 for:
>> 
>> -rwsr-xr-x  1 root root  32K Oct 20  2006 rlogin
>> -rwsr-xr-x  1 root root  22K Oct 20  2006 rsh
>> -rwsr-xr-x  1 root root  23K Oct 20  2006 testsuidroot
>> 
>> They must have a setuid as they must run as root.
>> 
>> I have fixed these and the root_squash issue.  The jobs are qmake
>> jobs that look like:
>> 
>> qmake -pe Test 12-16 -cwd -v PATH -- -j 16
>> 
>> I continue to get error messages related to not finding /tmp/<jobid>/
>> pid.  I don't know whether this is a symptom or a cause.
>> 
>> Sean
>> 
>> 
>> 
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list