[GE users] Plz help with strange shepherd message

Viktor Oudovenko udo at physics.rutgers.edu
Tue May 27 21:14:27 BST 2008


Daniel,

Thank you very much for your detailed answer.
I never tried to compile the SGE code.
I am going to update 6.0u4 to 6.1u4 hopping that this problem is gone.
I'd say that this problem shows up only when I started to pay with
suspend/resume stuff.
Before everything was fine.

I could provide with the following information:

I have my own suspend /resume scripts.

They are usually look like this (you've might get an idea wean could be
wrong).

::::::::::::::
sge_resume.sh
::::::::::::::
for file in sub04n149 sub04n151 sub04n152 
do
  /usr/bin/rsh $file "ps axuf|grep -v grep|grep -v sge_suspend|grep -v
sge_shepherd|grep -v job_scripts|grep 186328 | awk '{print \$2}' | xargs
kill -CONT "
done
exit 0
#
::::::::::::::
sge_suspend.sh
::::::::::::::
for file in sub04n149 sub04n151 sub04n152 
do
  /usr/bin/rsh $file "ps axuf|grep -v grep|grep -v sge_suspend|grep -v
sge_shepherd|grep -v job_scripts|grep 186328 | awk '{print \$2}' | xargs
kill -STOP "
done
exit 0
#


Before I had only rsh instead of /usr/bin/rsh and it was problem is suspend
script suspended itself.
Then I put /usr/bin/rsh and I thought that the problem is gone but then I
discovered it again.
And as I said it does not show up always.
It might have also something with qmaster restart.

Regards,
v

 

> -----Original Message-----
> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
> Sent: Tuesday, May 27, 2008 16:03
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Plz help with strange shepherd message
> 
> I just had a peek at the source code, and the trace file 
> creation works like this:  If the file doesn't exist yet, 
> create it as root, and then if the job owner isn't root, 
> chown the file to the job owner and seteuid to the job owner; 
> if the file does exist, just open it.  The error message 
> you're seeing comes from the code segment that opens an 
> existing file.  The odd thing is that the shepherd should be 
> running as root at that point, so it shouldn't be having a 
> problem opening the file.
> 
> Do you have the option to compile your own shepherd with 
> debugging information added?
> 
> Daniel
> 
> 
> Viktor Oudovenko wrote:
> > Yes!
> > Everything is fine with users.
> > Moreover, in the example I gave below everything runs fine.
> > I noticed problematic behavior even under my account when I 
> was logged  
> > in to machine and looked at the case.
> > v   
> >
> >
> >
> >   
> >> -----Original Message-----
> >> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM]
> >> Sent: Tuesday, May 27, 2008 15:40
> >> To: users at gridengine.sunsource.net
> >> Subject: Re: [GE users] Plz help with strange shepherd message
> >>
> >> Does the given user exist on that machine?
> >>
> >> Daniel
> >>
> >> Viktor Oudovenko wrote:
> >>     
> >>> Daniel,
> >>>
> >>> Root can write in any place. This is for sure.
> >>> The problem is that in directory:
> >>> /opt/SGE/spool/sub04n157/active_jobs/186117.1
> >>> There is trace file which belongs to user but in subdirectory
> >>> 1.sub04n157 (so the full path is
> >>> /opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/ trace 
> >>> belongs to root).
> >>> And shepherd.XXXX belongs to a user, so it is natural 
> that user can 
> >>> not right to file which belowns to root.
> >>> The problem is why does the system try to do it?
> >>>
> >>> OK. To be more clrear here is example from another job but
> >>>       
> >> it will be
> >>     
> >>> clear seen permissions:
> >>>
> >>>
> >>>       
> >> 
> [15:14:39]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/186328.1
> >> >
> >>     
> >>> ls -al total 32 drwxr-xr-x 3 sgeadmin sge  320 2008-05-27 08:58 .
> >>> drwxr-xr-x 3 sgeadmin sge   72 2008-05-27 08:58 ..
> >>> drwxr-xr-x 2 sgeadmin sge  256 2008-05-27 08:58 1.sub04n178
> >>> -rw-r--r-- 1 sgeadmin sge    6 2008-05-27 08:58 addgrpid
> >>> -rw-r--r-- 1 sgeadmin sge 1793 2008-05-27 08:58 config
> >>> -rw-r--r-- 1 sgeadmin sge 1577 2008-05-27 08:58 environment
> >>> -rw-r--r-- 1 camjayi  sge    0 2008-05-27 08:58 error
> >>> -rw-r--r-- 1 camjayi  sge    0 2008-05-27 08:58 exit_status
> >>> -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 job_pid
> >>> -rw-r--r-- 1 sgeadmin sge 1240 2008-05-27 08:58 pe_hostfile
> >>> -rw-r--r-- 1 sgeadmin sge    4 2008-05-27 08:58 pid
> >>> -rw-r--r-- 1 camjayi  sge 4116 2008-05-27 08:58 trace
> >>>
> >>>
> >>>       
> >> 
> [15:14:43]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/186328.1
> >> >
> >>     
> >>> ls -l 1.sub04n178/ total 24
> >>> -rw-r--r-- 1 sgeadmin sge    6 2008-05-27 08:58 addgrpid
> >>> -rw-r--r-- 1 sgeadmin sge 1891 2008-05-27 08:58 config
> >>> -rw-r--r-- 1 sgeadmin sge 1845 2008-05-27 08:58 environment
> >>> -rw-r--r-- 1 root     sge    0 2008-05-27 08:58 error
> >>> -rw-r--r-- 1 root     sge    0 2008-05-27 08:58 exit_status
> >>> -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 job_pid
> >>> -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 pid
> >>> -rw-r--r-- 1 root     sge 2665 2008-05-27 08:58 trace
> >>>
> >>>       
> >> 
> [15:14:51]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/186328.1
> >> >
> >>     
> >>> So, as you see in the active_jobs directory trace belongs
> >>>       
> >> to user . It
> >>     
> >>> is fine . But in subdirectory , in this example : 
> >>>       
> >> 1.sub04n178 trace is
> >>     
> >>> root owned.
> >>>
> >>> And it is general behavior in the system. 
> >>>
> >>> Regards,
> >>> v
> >>>
> >>>
> >>>   
> >>>       
> >>>> -----Original Message-----
> >>>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM]
> >>>> Sent: Tuesday, May 27, 2008 14:47
> >>>> To: users at gridengine.sunsource.net
> >>>> Subject: Re: [GE users] Plz help with strange shepherd message
> >>>>
> >>>> Check that the host where the file is generated has 
> permission to 
> >>>> write the to the /opt/SGE/spool/sub04n157/active_jobs 
> directory as 
> >>>> root.
> >>>>
> >>>> Daniel
> >>>>
> >>>> Viktor Oudovenko wrote:
> >>>>     
> >>>>         
> >>>>> HI,
> >>>>>
> >>>>> Recently I was playing with jobs suspension and wrote 
> >>>>> suspension/resume scripts and time after time (very often
> >>>>>           
> >> it is OK)
> >>     
> >>>>> for parallel jobs I see that in /tmp directory every minute
> >>>>>       
> >>>>>           
> >>>> one file
> >>>>     
> >>>>         
> >>>>> shephherd.XXXX, where XXXX is number is generated. Plz se
> >>>>>       
> >>>>>           
> >>>> below usual content of on of those files.
> >>>>     
> >>>>         
> >>>>> Plz let me know what might cause such kind of behavior.
> >>>>>
> >>>>> shepherd.30448
> >>>>> ::::::::::::::
> >>>>> 05/27/2008 02:48:11 [37394:37394 30448]: PANIC:
> >>>>>
> >>>>>       
> >>>>>           
> >> 
> open(/opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/trace)
> >>     
> >>>>     
> >>>>         
> >>>>> failed: Permission denied
> >>>>> 05/27/2008 02:48:11 [37394:37394 30448]: PANIC:
> >>>>>
> >>>>>       
> >>>>>           
> >> 
> open(/opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/trace)
> >>     
> >>>>     
> >>>>         
> >>>>> failed: Permission denied	 
> >>>>>
> >>>>> Thank you very much for your help, Vic P.s. 
> shepherd.XXXX has user 
> >>>>> permission. User who runs job.
> >>>>>
> >>>>>
> >>>>>
> >>>>>       
> >>>>>           
> >> 
> ---------------------------------------------------------------------
> >>     
> >>>>     
> >>>>         
> >>>>> To unsubscribe, e-mail: 
> users-unsubscribe at gridengine.sunsource.net
> >>>>> For additional commands, e-mail: 
> >>>>>           
> >> users-help at gridengine.sunsource.net
> >>     
> >>>>>   
> >>>>>       
> >>>>>           
> >> 
> ---------------------------------------------------------------------
> >>     
> >>>> To unsubscribe, e-mail: 
> users-unsubscribe at gridengine.sunsource.net
> >>>> For additional commands, e-mail: 
> >>>>         
> >> users-help at gridengine.sunsource.net
> >>     
> >>>>     
> >>>>         
> >>>
> >>>       
> >> 
> ---------------------------------------------------------------------
> >>     
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: 
> users-help at gridengine.sunsource.net
> >>>
> >>>   
> >>>       
> >> 
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: 
> users-help at gridengine.sunsource.net
> >>
> >>     
> >
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >   
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list