[GE users] Plz help with strange shepherd message

Viktor Oudovenko udo at physics.rutgers.edu
Tue May 27 20:57:10 BST 2008


HI, Reuti,

Thank you for replaying!

> > Root can write in any place. This is for sure.
> 
> if it's NFS mounted, there might be a root_squash in place in 
> the / etc/exports on the file server.

Spool is local directory.
 
> > The problem is that in directory:
> > /opt/SGE/spool/sub04n157/active_jobs/186117.1
> > There is trace file which belongs to user but in subdirectory
> > 1.sub04n157
> > (so the full path is
> > /opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/ trace 
> > belongs to root).
> 
> Having it local would be more convenient and lowers the 
> network traffic. It's mounted in your installation right now?

Spool is always local and I mount default directory only for jobs submision
or end.
Now we are talking about only local stuff.

> > And shepherd.XXXX belongs to a user, so it is natural that user can 
> > not right to file which belowns to root.
> > The problem is why does the system try to do it?
> >
> > OK. To be more clrear here is example from another job but 
> it will be 
> > clear seen permissions:
> >
> > [15:14:39]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/
> > 186328.1>ls -al
> > total 32
> > drwxr-xr-x 3 sgeadmin sge  320 2008-05-27 08:58 .
> > drwxr-xr-x 3 sgeadmin sge   72 2008-05-27 08:58 ..
> > drwxr-xr-x 2 sgeadmin sge  256 2008-05-27 08:58 1.sub04n178
> > -rw-r--r-- 1 sgeadmin sge    6 2008-05-27 08:58 addgrpid
> > -rw-r--r-- 1 sgeadmin sge 1793 2008-05-27 08:58 config
> > -rw-r--r-- 1 sgeadmin sge 1577 2008-05-27 08:58 environment
> > -rw-r--r-- 1 camjayi  sge    0 2008-05-27 08:58 error
> > -rw-r--r-- 1 camjayi  sge    0 2008-05-27 08:58 exit_status
> > -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 job_pid
> > -rw-r--r-- 1 sgeadmin sge 1240 2008-05-27 08:58 pe_hostfile
> > -rw-r--r-- 1 sgeadmin sge    4 2008-05-27 08:58 pid
> > -rw-r--r-- 1 camjayi  sge 4116 2008-05-27 08:58 trace
> >
> > [15:14:43]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/
> > 186328.1>ls -l
> > 1.sub04n178/
> > total 24
> > -rw-r--r-- 1 sgeadmin sge    6 2008-05-27 08:58 addgrpid
> > -rw-r--r-- 1 sgeadmin sge 1891 2008-05-27 08:58 config
> > -rw-r--r-- 1 sgeadmin sge 1845 2008-05-27 08:58 environment
> > -rw-r--r-- 1 root     sge    0 2008-05-27 08:58 error
> > -rw-r--r-- 1 root     sge    0 2008-05-27 08:58 exit_status
> > -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 job_pid
> > -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 pid
> > -rw-r--r-- 1 root     sge 2665 2008-05-27 08:58 trace
> > 
> [15:14:51]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/186328.1>
> 
> So, it's a parallel job, and the local qrsh ends up in root. 
> The qrsh is the default configuration or configured to be ssh?

It is parallel jobs. And qrsh belong to root and I use RSH for
communications:

[15:56:30]udo at sub04n157:~>cd /opt/SGE/
[15:56:35]udo at sub04n157:/opt/SGE>find . -name qrsh
./bin/lx24-x86/qrsh
[15:56:41]udo at sub04n157:/opt/SGE>ls -l ./bin/lx24-x86/qrsh
lrwxrwxrwx 1 root bin 3 2007-04-23 10:56 ./bin/lx24-x86/qrsh -> qsh
[15:56:46]udo at sub04n157:/opt/SGE>ls -l ./bin/lx24-x86/
total 21476
-rwxr-xr-x 1 root root 1277114 2005-05-24 04:22 qacct
-rwxr-xr-x 1 root root 1143438 2005-05-24 04:22 qalter
-rwxr-xr-x 1 root root 1326482 2005-05-24 04:22 qconf
-rwxr-xr-x 1 root root  798911 2005-05-24 04:22 qdel
lrwxrwxrwx 1 root bin        6 2007-04-23 10:56 qhold -> qalter
-rwxr-xr-x 1 root root 1300303 2005-05-24 04:22 qhost
lrwxrwxrwx 1 root bin        3 2007-04-23 10:56 qlogin -> qsh
-rwxr-xr-x 1 root root  433760 2005-05-24 04:22 qmake
-rwxr-xr-x 1 root root  813382 2005-05-24 04:22 qmod
-rwxr-xr-x 1 root root 1872990 2005-05-24 04:22 qmon
-rwxr-xr-x 1 root root  749708 2005-05-24 04:22 qping
lrwxrwxrwx 1 root bin        6 2007-04-23 10:56 qresub -> qalter
lrwxrwxrwx 1 root bin        6 2007-04-23 10:56 qrls -> qalter
lrwxrwxrwx 1 root bin        3 2007-04-23 10:56 qrsh -> qsh
lrwxrwxrwx 1 root bin        5 2007-04-23 10:56 qselect -> qstat
-rwxr-xr-x 1 root root 1200723 2005-05-24 04:22 qsh
-rwxr-xr-x 1 root root 1373553 2005-05-24 04:22 qstat
-rwxr-xr-x 1 root root 1257588 2005-05-24 04:22 qsub
-rwxr-xr-x 1 root root 1923977 2005-05-24 04:22 qtcsh
-rwxr-xr-x 1 root root  161613 2005-05-24 04:22 sge_coshepherd
-rwxr-xr-x 1 root root 1389783 2005-05-24 04:22 sge_execd
-rwxr-xr-x 1 root root 1795500 2005-05-24 04:22 sge_qmaster
-rwxr-xr-x 1 root root 1499987 2005-05-24 04:22 sge_schedd
-rwxr-xr-x 1 root root  805330 2005-05-24 04:22 sge_shadowd
-rwxr-xr-x 1 root root  804987 2005-05-24 04:22 sge_shepherd
[15:56:50]udo at sub04n157:/opt/SGE>

Regards,
v


> -- Reuti
> 
> >
> >
> > So, as you see in the active_jobs directory trace belongs 
> to user .  
> > It is
> > fine . But in subdirectory , in this example : 1.sub04n178 trace is 
> > root owned.
> >
> > And it is general behavior in the system.
> >
> > Regards,
> > v
> >
> >
> >> -----Original Message-----
> >> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM]
> >> Sent: Tuesday, May 27, 2008 14:47
> >> To: users at gridengine.sunsource.net
> >> Subject: Re: [GE users] Plz help with strange shepherd message
> >>
> >> Check that the host where the file is generated has permission to 
> >> write the to the /opt/SGE/spool/sub04n157/active_jobs directory as 
> >> root.
> >>
> >> Daniel
> >>
> >> Viktor Oudovenko wrote:
> >>> HI,
> >>>
> >>> Recently I was playing with jobs suspension and wrote 
> >>> suspension/resume scripts and time after time (very often 
> it is OK) 
> >>> for parallel jobs I see that in /tmp directory every minute
> >> one file
> >>> shephherd.XXXX, where XXXX is number is generated. Plz se
> >> below usual content of on of those files.
> >>> Plz let me know what might cause such kind of behavior.
> >>>
> >>> shepherd.30448
> >>> ::::::::::::::
> >>> 05/27/2008 02:48:11 [37394:37394 30448]: PANIC:
> >>>
> >> 
> open(/opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/trace)
> >>> failed: Permission denied
> >>> 05/27/2008 02:48:11 [37394:37394 30448]: PANIC:
> >>>
> >> 
> open(/opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/trace)
> >>> failed: Permission denied	
> >>>
> >>> Thank you very much for your help,
> >>> Vic
> >>> P.s. shepherd.XXXX has user permission. User who runs job.
> >>>
> >>>
> >>>
> >> 
> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: 
> users-help at gridengine.sunsource.net
> >>>
> >>>
> >>
> >> 
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: 
> users-help at gridengine.sunsource.net
> >>
> >
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list