[GE users] Plz help with strange shepherd message

Viktor Oudovenko udo at physics.rutgers.edu
Tue May 27 21:47:31 BST 2008


Does not look similar.
In my case everything is local.
But why the system needs 2 trace files?

I think the problem is that at some point the system gets confused and
instead of writing to fist trace which belongs to user it starts to write to
the second one in subdirectory which belongs to root.

Regard,
v 

> -----Original Message-----
> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
> Sent: Tuesday, May 27, 2008 16:39
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Plz help with strange shepherd message
> 
> Check out issue 1752:
> 
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1752
> 
> Daniel
> 
> Viktor Oudovenko wrote:
> > Daniel,
> >
> > Thank you very much for your detailed answer.
> > I never tried to compile the SGE code.
> > I am going to update 6.0u4 to 6.1u4 hopping that this 
> problem is gone.
> > I'd say that this problem shows up only when I started to pay with 
> > suspend/resume stuff.
> > Before everything was fine.
> >
> > I could provide with the following information:
> >
> > I have my own suspend /resume scripts.
> >
> > They are usually look like this (you've might get an idea 
> wean could 
> > be wrong).
> >
> > ::::::::::::::
> > sge_resume.sh
> > ::::::::::::::
> > for file in sub04n149 sub04n151 sub04n152 do
> >   /usr/bin/rsh $file "ps axuf|grep -v grep|grep -v 
> sge_suspend|grep -v 
> > sge_shepherd|grep -v job_scripts|grep 186328 | awk '{print \$2}' | 
> > xargs kill -CONT "
> > done
> > exit 0
> > #
> > ::::::::::::::
> > sge_suspend.sh
> > ::::::::::::::
> > for file in sub04n149 sub04n151 sub04n152 do
> >   /usr/bin/rsh $file "ps axuf|grep -v grep|grep -v 
> sge_suspend|grep -v 
> > sge_shepherd|grep -v job_scripts|grep 186328 | awk '{print \$2}' | 
> > xargs kill -STOP "
> > done
> > exit 0
> > #
> >
> >
> > Before I had only rsh instead of /usr/bin/rsh and it was 
> problem is suspend
> > script suspended itself.
> > Then I put /usr/bin/rsh and I thought that the problem is 
> gone but then I
> > discovered it again.
> > And as I said it does not show up always.
> > It might have also something with qmaster restart.
> >
> > Regards,
> > v
> >
> >  
> >
> >   
> >> -----Original Message-----
> >> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
> >> Sent: Tuesday, May 27, 2008 16:03
> >> To: users at gridengine.sunsource.net
> >> Subject: Re: [GE users] Plz help with strange shepherd message
> >>
> >> I just had a peek at the source code, and the trace file 
> >> creation works like this:  If the file doesn't exist yet, 
> >> create it as root, and then if the job owner isn't root, 
> >> chown the file to the job owner and seteuid to the job owner; 
> >> if the file does exist, just open it.  The error message 
> >> you're seeing comes from the code segment that opens an 
> >> existing file.  The odd thing is that the shepherd should be 
> >> running as root at that point, so it shouldn't be having a 
> >> problem opening the file.
> >>
> >> Do you have the option to compile your own shepherd with 
> >> debugging information added?
> >>
> >> Daniel
> >>
> >>
> >> Viktor Oudovenko wrote:
> >>     
> >>> Yes!
> >>> Everything is fine with users.
> >>> Moreover, in the example I gave below everything runs fine.
> >>> I noticed problematic behavior even under my account when I 
> >>>       
> >> was logged  
> >>     
> >>> in to machine and looked at the case.
> >>> v   
> >>>
> >>>
> >>>
> >>>   
> >>>       
> >>>> -----Original Message-----
> >>>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM]
> >>>> Sent: Tuesday, May 27, 2008 15:40
> >>>> To: users at gridengine.sunsource.net
> >>>> Subject: Re: [GE users] Plz help with strange shepherd message
> >>>>
> >>>> Does the given user exist on that machine?
> >>>>
> >>>> Daniel
> >>>>
> >>>> Viktor Oudovenko wrote:
> >>>>     
> >>>>         
> >>>>> Daniel,
> >>>>>
> >>>>> Root can write in any place. This is for sure.
> >>>>> The problem is that in directory:
> >>>>> /opt/SGE/spool/sub04n157/active_jobs/186117.1
> >>>>> There is trace file which belongs to user but in subdirectory
> >>>>> 1.sub04n157 (so the full path is
> >>>>> 
> /opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/ trace 
> >>>>> belongs to root).
> >>>>> And shepherd.XXXX belongs to a user, so it is natural 
> >>>>>           
> >> that user can 
> >>     
> >>>>> not right to file which belowns to root.
> >>>>> The problem is why does the system try to do it?
> >>>>>
> >>>>> OK. To be more clrear here is example from another job but
> >>>>>       
> >>>>>           
> >>>> it will be
> >>>>     
> >>>>         
> >>>>> clear seen permissions:
> >>>>>
> >>>>>
> >>>>>       
> >>>>>           
> >> 
> [15:14:39]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/186328.1
> >>     
> >>>>     
> >>>>         
> >>>>> ls -al total 32 drwxr-xr-x 3 sgeadmin sge  320 
> 2008-05-27 08:58 .
> >>>>> drwxr-xr-x 3 sgeadmin sge   72 2008-05-27 08:58 ..
> >>>>> drwxr-xr-x 2 sgeadmin sge  256 2008-05-27 08:58 1.sub04n178
> >>>>> -rw-r--r-- 1 sgeadmin sge    6 2008-05-27 08:58 addgrpid
> >>>>> -rw-r--r-- 1 sgeadmin sge 1793 2008-05-27 08:58 config
> >>>>> -rw-r--r-- 1 sgeadmin sge 1577 2008-05-27 08:58 environment
> >>>>> -rw-r--r-- 1 camjayi  sge    0 2008-05-27 08:58 error
> >>>>> -rw-r--r-- 1 camjayi  sge    0 2008-05-27 08:58 exit_status
> >>>>> -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 job_pid
> >>>>> -rw-r--r-- 1 sgeadmin sge 1240 2008-05-27 08:58 pe_hostfile
> >>>>> -rw-r--r-- 1 sgeadmin sge    4 2008-05-27 08:58 pid
> >>>>> -rw-r--r-- 1 camjayi  sge 4116 2008-05-27 08:58 trace
> >>>>>
> >>>>>
> >>>>>       
> >>>>>           
> >> 
> [15:14:43]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/186328.1
> >>     
> >>>>     
> >>>>         
> >>>>> ls -l 1.sub04n178/ total 24
> >>>>> -rw-r--r-- 1 sgeadmin sge    6 2008-05-27 08:58 addgrpid
> >>>>> -rw-r--r-- 1 sgeadmin sge 1891 2008-05-27 08:58 config
> >>>>> -rw-r--r-- 1 sgeadmin sge 1845 2008-05-27 08:58 environment
> >>>>> -rw-r--r-- 1 root     sge    0 2008-05-27 08:58 error
> >>>>> -rw-r--r-- 1 root     sge    0 2008-05-27 08:58 exit_status
> >>>>> -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 job_pid
> >>>>> -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 pid
> >>>>> -rw-r--r-- 1 root     sge 2665 2008-05-27 08:58 trace
> >>>>>
> >>>>>       
> >>>>>           
> >> 
> [15:14:51]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/186328.1
> >>     
> >>>>     
> >>>>         
> >>>>> So, as you see in the active_jobs directory trace belongs
> >>>>>       
> >>>>>           
> >>>> to user . It
> >>>>     
> >>>>         
> >>>>> is fine . But in subdirectory , in this example : 
> >>>>>       
> >>>>>           
> >>>> 1.sub04n178 trace is
> >>>>     
> >>>>         
> >>>>> root owned.
> >>>>>
> >>>>> And it is general behavior in the system. 
> >>>>>
> >>>>> Regards,
> >>>>> v
> >>>>>
> >>>>>
> >>>>>   
> >>>>>       
> >>>>>           
> >>>>>> -----Original Message-----
> >>>>>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM]
> >>>>>> Sent: Tuesday, May 27, 2008 14:47
> >>>>>> To: users at gridengine.sunsource.net
> >>>>>> Subject: Re: [GE users] Plz help with strange shepherd message
> >>>>>>
> >>>>>> Check that the host where the file is generated has 
> >>>>>>             
> >> permission to 
> >>     
> >>>>>> write the to the /opt/SGE/spool/sub04n157/active_jobs 
> >>>>>>             
> >> directory as 
> >>     
> >>>>>> root.
> >>>>>>
> >>>>>> Daniel
> >>>>>>
> >>>>>> Viktor Oudovenko wrote:
> >>>>>>     
> >>>>>>         
> >>>>>>             
> >>>>>>> HI,
> >>>>>>>
> >>>>>>> Recently I was playing with jobs suspension and wrote 
> >>>>>>> suspension/resume scripts and time after time (very often
> >>>>>>>           
> >>>>>>>               
> >>>> it is OK)
> >>>>     
> >>>>         
> >>>>>>> for parallel jobs I see that in /tmp directory every minute
> >>>>>>>       
> >>>>>>>           
> >>>>>>>               
> >>>>>> one file
> >>>>>>     
> >>>>>>         
> >>>>>>             
> >>>>>>> shephherd.XXXX, where XXXX is number is generated. Plz se
> >>>>>>>       
> >>>>>>>           
> >>>>>>>               
> >>>>>> below usual content of on of those files.
> >>>>>>     
> >>>>>>         
> >>>>>>             
> >>>>>>> Plz let me know what might cause such kind of behavior.
> >>>>>>>
> >>>>>>> shepherd.30448
> >>>>>>> ::::::::::::::
> >>>>>>> 05/27/2008 02:48:11 [37394:37394 30448]: PANIC:
> >>>>>>>
> >>>>>>>       
> >>>>>>>           
> >>>>>>>               
> >> 
> open(/opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/trace)
> >>     
> >>>>     
> >>>>         
> >>>>>>     
> >>>>>>         
> >>>>>>             
> >>>>>>> failed: Permission denied
> >>>>>>> 05/27/2008 02:48:11 [37394:37394 30448]: PANIC:
> >>>>>>>
> >>>>>>>       
> >>>>>>>           
> >>>>>>>               
> >> 
> open(/opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/trace)
> >>     
> >>>>     
> >>>>         
> >>>>>>     
> >>>>>>         
> >>>>>>             
> >>>>>>> failed: Permission denied	 
> >>>>>>>
> >>>>>>> Thank you very much for your help, Vic P.s. 
> >>>>>>>               
> >> shepherd.XXXX has user 
> >>     
> >>>>>>> permission. User who runs job.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>       
> >>>>>>>           
> >>>>>>>               
> >> 
> ---------------------------------------------------------------------
> >>     
> >>>>     
> >>>>         
> >>>>>>     
> >>>>>>         
> >>>>>>             
> >>>>>>> To unsubscribe, e-mail: 
> >>>>>>>               
> >> users-unsubscribe at gridengine.sunsource.net
> >>     
> >>>>>>> For additional commands, e-mail: 
> >>>>>>>           
> >>>>>>>               
> >>>> users-help at gridengine.sunsource.net
> >>>>     
> >>>>         
> >>>>>>>   
> >>>>>>>       
> >>>>>>>           
> >>>>>>>               
> >> 
> ---------------------------------------------------------------------
> >>     
> >>>>     
> >>>>         
> >>>>>> To unsubscribe, e-mail: 
> >>>>>>             
> >> users-unsubscribe at gridengine.sunsource.net
> >>     
> >>>>>> For additional commands, e-mail: 
> >>>>>>         
> >>>>>>             
> >>>> users-help at gridengine.sunsource.net
> >>>>     
> >>>>         
> >>>>>>     
> >>>>>>         
> >>>>>>             
> >>>>>       
> >>>>>           
> >> 
> ---------------------------------------------------------------------
> >>     
> >>>>     
> >>>>         
> >>>>> To unsubscribe, e-mail: 
> users-unsubscribe at gridengine.sunsource.net
> >>>>> For additional commands, e-mail: 
> >>>>>           
> >> users-help at gridengine.sunsource.net
> >>     
> >>>>>   
> >>>>>       
> >>>>>           
> >> 
> ---------------------------------------------------------------------
> >>     
> >>>> To unsubscribe, e-mail: 
> users-unsubscribe at gridengine.sunsource.net
> >>>> For additional commands, e-mail: 
> >>>>         
> >> users-help at gridengine.sunsource.net
> >>     
> >>>>     
> >>>>         
> >>>
> >>>       
> >> 
> ---------------------------------------------------------------------
> >>     
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: 
> users-help at gridengine.sunsource.net
> >>>
> >>>   
> >>>       
> >> 
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: 
> users-help at gridengine.sunsource.net
> >>
> >>     
> >
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >   
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list