[GE users] qmake job and errors: pid: No such file or directory

Reuti reuti at staff.uni-marburg.de
Sat Dec 29 12:58:29 GMT 2007


Am 29.12.2007 um 00:11 schrieb Sean Davis:

> On Dec 28, 2007 5:16 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
>
> Am 28.12.2007 um 22:15 schrieb Sean Davis:
>
> > I am trying to run qmake on the Solexa analysis pipeline (probably
> > not important, but....).  When I run this command, I get the
> > following error:
> >
> > can't open file /tmp/285.1.all.q/pid: No such file or directory
> >
> > I have the shepherd trace available, also.  Does this problem ring
> > any bells for anyone?
> >
> > As for details of our setup, we have several linux boxes running
> > the lx24-amd64 binaries (though they are intel machines).  All are
> > using ssh for communication.  None has a firewall enabled.  They
> > are using shared home directories, but /tmp, etc., are local to the
> > machines.  Qlogin, qrsh, and qsh seem to be working.  We have
> > openmpi installed, also.
> >
> > I know the question is pretty vague.  I am pretty new to SGE, so
> > debugging these issues is pretty new also.  Any guidance is
> > appreciated.
>
> usually the 'pid' file should go to a spool directory like:
>
> $SGE_ROOT/spool/sge/<node>/active_jobs/<job_id>.<task_id>
>
> or better:
>
> /var/spool/sge/<node>/active_jobs/<job_id>.<task_id>
>
> and not the local tmp directory for the job. Where is the spool
> directory for the node: local or on the NFS server? Best would be to
> have it local on all nodes:
>
> http://gridengine.sunsource.net/howto/nfsreduce.html
>
> Thanks, Reuti and John.
>
> The /tmp directory is world read and write, just to make certain.
>
> How can I set the location of the pid file?  Is there a convenient  
> way to check where the local spooling for each node is located?
>
> Here is a bit of the shepherd trace, as it seems like the local  
> spool directory is used, but also the tmp directory on the qmaster:
>
> 2/28/2007 16:05:54 [10020:25961]: setting environment
> 12/28/2007 16:05:54 [10020:25961]: Initializing error file
> 12/28/2007 16:05:54 [10020:25958]: forked "job" with pid 25961
> 12/28/2007 16:05:54 [10020:25958]: child: job - pid: 25961
> 12/28/2007 16:05:54 [10020:25961]: switching to intermediate/target  
> user
> 12/28/2007 16:05:54 [10005:25961]: closing all filedescriptors
> 12/28/2007 16:05:54 [10005:25961]: further messages are in "error"  
> and "trace"
> 12/28/2007 16:05:54 [0:25961]: now running with uid=0, euid=0
> 12/28/2007 16:05:54 [0:25961]: start qlogin
> 12/28/2007 16:05:54 [0:25961]: calling qlogin_starter(/var/spool/sge
> /local/pressa/active_jobs/285.1, /usr/sbin/sshd -i);
> 12/28/2007 16:05:54 [0:25961]: uid = 0, euid = 0, gid = 0, egid = 0
> 12/28/2007 16:05:54 [0:25961]: using sfd 1
> 12/28/2007 16:05:54 [0:25961]: bound to port 54613
> 12/28/2007 16:05:54 [0:25961]: write_to_qrsh - data = 0:54613:/usr/ 
> local/sge/utilbin/lx24-amd64:/var/spool/sge/local/pressa/ 
> active_jobs/285.1:pressa
> 12/28/2007 16:05:54 [0:25961]: write_to_qrsh - address =  
> shakespeare:44242
> 12/28/2007 16:05:54 [0:25961]: write_to_qrsh - host = shakespeare,  
> port = 44242
> 12/28/2007 16:05:54 [0:25961]: waiting for connection.
> 12/28/2007 16:06:54 [0:25961]: nobody connected to the socket

What are the protections of your var/spool/sge/local/pressa/ 
active_jobs and who is the owner? Who is the admin user of your SGE  
installation (the one who owns /usr/local/sge)?

-- Reuti


> 12/28/2007 16:06:54 [0:25961]: forked "job" with pid 0
> 12/28/2007 16:06:54 [0:25961]: child: job - pid: 0
> 12/28/2007 16:06:54 [0:25961]: wait3 returned -1
> 12/28/2007 16:06:54 [0:25961]: can't open file /tmp/285.1.all.q/ 
> pid: No such file or directory
> 12/28/2007 16:06:54 [0:25961]: write_to_qrsh - data = 1:can't open  
> file /tmp/285.1.all.q/pid: No such file or directory
> 12/28/2007 16:06:54 [0:25961]: write_to_qrsh - address = shakespeare
> 12/28/2007 16:06:54 [0:25961]: illegal value for qrsh_control_port:  
> " shakespeare". Should be host:port
> 12/28/2007 16:06:54 [10020:25958]: wait3 returned 25961 (status:  
> 2816; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 11)
> 12/28/2007 16:06:54 [10020:25958]: job exited with exit status 11
> 12/28/2007 16:06:54 [10020:25958]: reaped "job" with pid 25961
> 12/28/2007 16:06:54 [10020:25958]: job exited not due to signal
> 12/28/2007 16:06:54 [10020:25958]: job exited with status 11
> 12/28/2007 16:06:54 [0:25958]: can't open file /tmp/285.1.all.q/ 
> pid: No such file or directory
> 12/28/2007 16:06:54 [0:25958]: write_to_qrsh - data = 1:can't open  
> file /tmp/285.1.all.q/pid: No such file or directory
> 12/28/2007 16:06:54 [0:25958]: write_to_qrsh - address =  
> shakespeare:44242
> 12/28/2007 16:06:54 [0:25958]: write_to_qrsh - host = shakespeare,  
> port = 44242
>
> Any other ideas?
>
> Thanks again,
> Sean
>
>




More information about the gridengine-users mailing list