[GE users] qmake job and errors: pid: No such file or directory

Sean Davis sdavis2 at mail.nih.gov
Sun Dec 30 22:19:40 GMT 2007


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

On Dec 30, 2007 10:29 AM, Reuti <reuti at staff.uni-marburg.de> wrote:

> Am 29.12.2007 um 15:05 schrieb Sean Davis:
>
> On Dec 29, 2007 7:58 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
>
> > Am 29.12.2007 um 00:11 schrieb Sean Davis:
> >
> > On Dec 28, 2007 5:16 PM, Reuti < reuti at staff.uni-marburg.de> wrote:
> >
> > > Hi,
> > >
> > > Am 28.12.2007 um 22:15 schrieb Sean Davis:
> > >
> > > > I am trying to run qmake on the Solexa analysis pipeline (probably
> > > > not important, but....).  When I run this command, I get the
> > > > following error:
> > > >
> > > > can't open file /tmp/285.1.all.q/pid: No such file or directory
> > > >
> > > > I have the shepherd trace available, also.  Does this problem ring
> > > > any bells for anyone?
> > > >
> > > > As for details of our setup, we have several linux boxes running
> > > > the lx24-amd64 binaries (though they are intel machines).  All are
> > > > using ssh for communication.  None has a firewall enabled.  They
> > > > are using shared home directories, but /tmp, etc., are local to the
> > > > machines.  Qlogin, qrsh, and qsh seem to be working.  We have
> > > > openmpi installed, also.
> > > >
> > > > I know the question is pretty vague.  I am pretty new to SGE, so
> > > > debugging these issues is pretty new also.  Any guidance is
> > > > appreciated.
> > >
> > > usually the 'pid' file should go to a spool directory like:
> > >
> > > $SGE_ROOT/spool/sge/<node>/active_jobs/<job_id>.<task_id>
> > >
> > > or better:
> > >
> > > /var/spool/sge/<node>/active_jobs/<job_id>.<task_id>
> > >
> > > and not the local tmp directory for the job. Where is the spool
> > > directory for the node: local or on the NFS server? Best would be to
> > > have it local on all nodes:
> > >
> > > http://gridengine.sunsource.net/howto/nfsreduce.html
> > >
> >
> > Thanks, Reuti and John.
> >
> > The /tmp directory is world read and write, just to make certain.
> >
> > How can I set the location of the pid file?  Is there a convenient way
> > to check where the local spooling for each node is located?
> >
> > Here is a bit of the shepherd trace, as it seems like the local spool
> > directory is used, but also the tmp directory on the qmaster:
> >
> > 2/28/2007 16:05:54 [10020:25961]: setting environment
> > 12/28/2007 16:05:54 [10020:25961]: Initializing error file
> > 12/28/2007 16:05:54 [10020:25958]: forked "job" with pid 25961
> > 12/28/2007 16:05:54 [10020:25958]: child: job - pid: 25961
> > 12/28/2007 16:05:54 [10020:25961]: switching to intermediate/target user
> > 12/28/2007 16:05:54 [10005:25961]: closing all filedescriptors
> > 12/28/2007 16:05:54 [10005:25961]: further messages are in "error" and
> > "trace"
> > 12/28/2007 16:05:54 [0:25961]: now running with uid=0, euid=0
> > 12/28/2007 16:05:54 [0:25961]: start qlogin
> > 12/28/2007 16:05:54 [0:25961]: calling qlogin_starter(/var/spool/sge /local/pressa/active_jobs/285.1,
> > /usr/sbin/sshd -i);
> > 12/28/2007 16:05:54 [0:25961]: uid = 0, euid = 0, gid = 0, egid = 0
> > 12/28/2007 16:05:54 [0:25961]: using sfd 1
> > 12/28/2007 16:05:54 [0:25961]: bound to port 54613
> > 12/28/2007 16:05:54 [0:25961]: write_to_qrsh - data =
> > 0:54613:/usr/local/sge/utilbin/lx24-amd64:/var/spool/sge/local/pressa/active_jobs/285.1:
> > pressa <http://pressa.nci.nih.gov/>
> > 12/28/2007 16:05:54 [0:25961]: write_to_qrsh - address =
> > shakespeare:44242 <http://shakespeare.nci.nih.gov:44242/>
> > 12/28/2007 16:05:54 [0:25961]: write_to_qrsh - host = shakespeare<http://shakespeare.nci.nih.gov/>,
> > port = 44242
> > 12/28/2007 16:05:54 [0:25961]: waiting for connection.
> > 12/28/2007 16:06:54 [0:25961]: nobody connected to the socket
> >
> >
> > What are the protections of your var/spool/sge/local/pressa/active_jobs
> > and who is the owner? Who is the admin user of your SGE installation (the
> > one who owns /usr/local/sge)?
> >
>
> Reuti,
>
> root owns /usr/local/sge--this could be a problem, since the partition is
> mounted with root-squash.  However, sgeadmin is the admin user (the user
> under which sge_execd, sge_qmaster, etc. run).  As for
> /var/spool/sge/local/pressa/active_jobs, the owner is sgeadmin; everyone can
> r-x, owner can rwx.
>
> I have changed the ownership of /usr/local/sge to sgeadmin, but the
> problem seems to remain.
>
>
> Are these serial or parallel jobs? With root_squash I see a problem: what
> are the settings in $SGE_ROOT/utilbin/lx24-amd64 for:
>
> -rwsr-xr-x  1 root root  32K Oct 20  2006 rlogin
> -rwsr-xr-x  1 root root  22K Oct 20  2006 rsh
> -rwsr-xr-x  1 root root  23K Oct 20  2006 testsuidroot
>
> They must have a setuid as they must run as root.
>

I have fixed these and the root_squash issue.  The jobs are qmake jobs that
look like:

qmake -pe Test 12-16 -cwd -v PATH -- -j 16

I continue to get error messages related to not finding /tmp/<jobid>/pid.  I
don't know whether this is a symptom or a cause.

Sean



More information about the gridengine-users mailing list