[GE users] qmake job and errors: pid: No such file or directory

Reuti reuti at staff.uni-marburg.de
Sun Dec 30 15:29:45 GMT 2007


Am 29.12.2007 um 15:05 schrieb Sean Davis:

> On Dec 29, 2007 7:58 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 29.12.2007 um 00:11 schrieb Sean Davis:
>
>> On Dec 28, 2007 5:16 PM, Reuti < reuti at staff.uni-marburg.de> wrote:
>> Hi,
>>
>> Am 28.12.2007 um 22:15 schrieb Sean Davis:
>>
>> > I am trying to run qmake on the Solexa analysis pipeline (probably
>> > not important, but....).  When I run this command, I get the
>> > following error:
>> >
>> > can't open file /tmp/285.1.all.q/pid: No such file or directory
>> >
>> > I have the shepherd trace available, also.  Does this problem ring
>> > any bells for anyone?
>> >
>> > As for details of our setup, we have several linux boxes running
>> > the lx24-amd64 binaries (though they are intel machines).  All are
>> > using ssh for communication.  None has a firewall enabled.  They
>> > are using shared home directories, but /tmp, etc., are local to the
>> > machines.  Qlogin, qrsh, and qsh seem to be working.  We have
>> > openmpi installed, also.
>> >
>> > I know the question is pretty vague.  I am pretty new to SGE, so
>> > debugging these issues is pretty new also.  Any guidance is
>> > appreciated.
>>
>> usually the 'pid' file should go to a spool directory like:
>>
>> $SGE_ROOT/spool/sge/<node>/active_jobs/<job_id>.<task_id>
>>
>> or better:
>>
>> /var/spool/sge/<node>/active_jobs/<job_id>.<task_id>
>>
>> and not the local tmp directory for the job. Where is the spool
>> directory for the node: local or on the NFS server? Best would be to
>> have it local on all nodes:
>>
>> http://gridengine.sunsource.net/howto/nfsreduce.html
>>
>> Thanks, Reuti and John.
>>
>> The /tmp directory is world read and write, just to make certain.
>>
>> How can I set the location of the pid file?  Is there a convenient  
>> way to check where the local spooling for each node is located?
>>
>> Here is a bit of the shepherd trace, as it seems like the local  
>> spool directory is used, but also the tmp directory on the qmaster:
>>
>> 2/28/2007 16:05:54 [10020:25961]: setting environment
>> 12/28/2007 16:05:54 [10020:25961]: Initializing error file
>> 12/28/2007 16:05:54 [10020:25958]: forked "job" with pid 25961
>> 12/28/2007 16:05:54 [10020:25958]: child: job - pid: 25961
>> 12/28/2007 16:05:54 [10020:25961]: switching to intermediate/ 
>> target user
>> 12/28/2007 16:05:54 [10005:25961]: closing all filedescriptors
>> 12/28/2007 16:05:54 [10005:25961]: further messages are in "error"  
>> and "trace"
>> 12/28/2007 16:05:54 [0:25961]: now running with uid=0, euid=0
>> 12/28/2007 16:05:54 [0:25961]: start qlogin
>> 12/28/2007 16:05:54 [0:25961]: calling qlogin_starter(/var/spool/sge
>> /local/pressa/active_jobs/285.1, /usr/sbin/sshd -i);
>> 12/28/2007 16:05:54 [0:25961]: uid = 0, euid = 0, gid = 0, egid = 0
>> 12/28/2007 16:05:54 [0:25961]: using sfd 1
>> 12/28/2007 16:05:54 [0:25961]: bound to port 54613
>> 12/28/2007 16:05:54 [0:25961]: write_to_qrsh - data = 0:54613:/usr/ 
>> local/sge/utilbin/lx24-amd64:/var/spool/sge/local/pressa/ 
>> active_jobs/285.1:pressa
>> 12/28/2007 16:05:54 [0:25961]: write_to_qrsh - address =  
>> shakespeare:44242
>> 12/28/2007 16:05:54 [0:25961]: write_to_qrsh - host = shakespeare,  
>> port = 44242
>> 12/28/2007 16:05:54 [0:25961]: waiting for connection.
>> 12/28/2007 16:06:54 [0:25961]: nobody connected to the socket
>
> What are the protections of your var/spool/sge/local/pressa/ 
> active_jobs and who is the owner? Who is the admin user of your SGE  
> installation (the one who owns /usr/local/sge)?
>
> Reuti,
>
> root owns /usr/local/sge--this could be a problem, since the  
> partition is mounted with root-squash.  However, sgeadmin is the  
> admin user (the user under which sge_execd, sge_qmaster, etc.  
> run).  As for /var/spool/sge/local/pressa/active_jobs, the owner is  
> sgeadmin; everyone can r-x, owner can rwx.
>
> I have changed the ownership of /usr/local/sge to sgeadmin, but the  
> problem seems to remain.


Are these serial or parallel jobs? With root_squash I see a problem:  
what are the settings in $SGE_ROOT/utilbin/lx24-amd64 for:

-rwsr-xr-x  1 root root  32K Oct 20  2006 rlogin
-rwsr-xr-x  1 root root  22K Oct 20  2006 rsh
-rwsr-xr-x  1 root root  23K Oct 20  2006 testsuidroot

They must have a setuid as they must run as root.

-- Reuti



More information about the gridengine-users mailing list