[GE users] Plz help with strange shepherd message

Viktor Oudovenko udo at physics.rutgers.edu
Tue May 27 22:42:30 BST 2008


Thank you very much, Reuti,

Yes I run mpich1  (I also have mpich2 implementation but still mpich1 is
easier to manage).
And you are right , to renice jobs is much easier.then to suspend which most
probably I'll to.
But I was surprised that all jobs did survive many times suspension /resume
circle.
So, MPI jobs in my case are very robust. I do not know why but it is
experimental fact.

Right now I run a few test with under my account and the problem I mentioned
did not show up.

But at some point it might. Pretty strange behavior. But to avoid it last
night I have restarted qmaster.
And it did help to resolve the situation for some  time.
Anyway thank you very much to everybody who tried to help.
I did understand that it is kind of  unusual problem.

Once more many thanks for very prompt replay to everybody involved!
With kind regards,
v
 

> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de] 
> Sent: Tuesday, May 27, 2008 16:51
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Plz help with strange shepherd message
> 
> Am 27.05.2008 um 22:14 schrieb Viktor Oudovenko:
> 
> > Daniel,
> >
> > Thank you very much for your detailed answer.
> > I never tried to compile the SGE code.
> > I am going to update 6.0u4 to 6.1u4 hopping that this 
> problem is gone.
> > I'd say that this problem shows up only when I started to pay with 
> > suspend/resume stuff.
> > Before everything was fine.
> 
> I would never dare to suspend a parallel appplication, unless 
> it's supported with a special suspend command from the 
> parallel lib out-of- the-box. Seems you are using MPICH1 and 
> might face some race conditions while suspending the tasks on 
> all the nodes - it's not done at the same time for all 
> processes, so some messages might get lost.
> 
> If you have enough memory in the machine, it's by far more 
> easier to give the parallel queue a priority (i.e. nice 
> value) of 19, and normal jobs a 0. This would keep the 
> parallel job running at a low rate, unless they are the only 
> ones on the nodes as nice values are only relative when there 
> are more processes running than cores are available.
> 
> -- Reuti
> 
> 
> > I could provide with the following information:
> >
> > I have my own suspend /resume scripts.
> >
> > They are usually look like this (you've might get an idea 
> wean could 
> > be wrong).
> >
> > ::::::::::::::
> > sge_resume.sh
> > ::::::::::::::
> > for file in sub04n149 sub04n151 sub04n152 do
> >   /usr/bin/rsh $file "ps axuf|grep -v grep|grep -v 
> sge_suspend|grep -v 
> > sge_shepherd|grep -v job_scripts|grep 186328 | awk '{print \$2}' | 
> > xargs kill -CONT "
> > done
> > exit 0
> > #
> > ::::::::::::::
> > sge_suspend.sh
> > ::::::::::::::
> > for file in sub04n149 sub04n151 sub04n152 do
> >   /usr/bin/rsh $file "ps axuf|grep -v grep|grep -v 
> sge_suspend|grep -v 
> > sge_shepherd|grep -v job_scripts|grep 186328 | awk '{print \$2}' | 
> > xargs kill -STOP "
> > done
> > exit 0
> > #
> >
> >
> > Before I had only rsh instead of /usr/bin/rsh and it was problem is 
> > suspend script suspended itself.
> > Then I put /usr/bin/rsh and I thought that the problem is gone but 
> > then I discovered it again.
> > And as I said it does not show up always.
> > It might have also something with qmaster restart.
> >
> > Regards,
> > v
> >
> >
> >
> >> -----Original Message-----
> >> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM]
> >> Sent: Tuesday, May 27, 2008 16:03
> >> To: users at gridengine.sunsource.net
> >> Subject: Re: [GE users] Plz help with strange shepherd message
> >>
> >> I just had a peek at the source code, and the trace file creation 
> >> works like this:  If the file doesn't exist yet, create it 
> as root, 
> >> and then if the job owner isn't root, chown the file to 
> the job owner 
> >> and seteuid to the job owner; if the file does exist, just 
> open it.  
> >> The error message you're seeing comes from the code segment that 
> >> opens an existing file.  The odd thing is that the 
> shepherd should be 
> >> running as root at that point, so it shouldn't be having a problem 
> >> opening the file.
> >>
> >> Do you have the option to compile your own shepherd with debugging 
> >> information added?
> >>
> >> Daniel
> >>
> >>
> >> Viktor Oudovenko wrote:
> >>> Yes!
> >>> Everything is fine with users.
> >>> Moreover, in the example I gave below everything runs fine.
> >>> I noticed problematic behavior even under my account when I
> >> was logged
> >>> in to machine and looked at the case.
> >>> v
> >>>
> >>>
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM]
> >>>> Sent: Tuesday, May 27, 2008 15:40
> >>>> To: users at gridengine.sunsource.net
> >>>> Subject: Re: [GE users] Plz help with strange shepherd message
> >>>>
> >>>> Does the given user exist on that machine?
> >>>>
> >>>> Daniel
> >>>>
> >>>> Viktor Oudovenko wrote:
> >>>>
> >>>>> Daniel,
> >>>>>
> >>>>> Root can write in any place. This is for sure.
> >>>>> The problem is that in directory:
> >>>>> /opt/SGE/spool/sub04n157/active_jobs/186117.1
> >>>>> There is trace file which belongs to user but in subdirectory
> >>>>> 1.sub04n157 (so the full path is
> >>>>> 
> /opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/ trace 
> >>>>> belongs to root).
> >>>>> And shepherd.XXXX belongs to a user, so it is natural
> >> that user can
> >>>>> not right to file which belowns to root.
> >>>>> The problem is why does the system try to do it?
> >>>>>
> >>>>> OK. To be more clrear here is example from another job but
> >>>>>
> >>>> it will be
> >>>>
> >>>>> clear seen permissions:
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >> 
> [15:14:39]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/186328.1
> >>>>>
> >>>>
> >>>>> ls -al total 32 drwxr-xr-x 3 sgeadmin sge  320 
> 2008-05-27 08:58 .
> >>>>> drwxr-xr-x 3 sgeadmin sge   72 2008-05-27 08:58 ..
> >>>>> drwxr-xr-x 2 sgeadmin sge  256 2008-05-27 08:58 1.sub04n178
> >>>>> -rw-r--r-- 1 sgeadmin sge    6 2008-05-27 08:58 addgrpid
> >>>>> -rw-r--r-- 1 sgeadmin sge 1793 2008-05-27 08:58 config
> >>>>> -rw-r--r-- 1 sgeadmin sge 1577 2008-05-27 08:58 environment
> >>>>> -rw-r--r-- 1 camjayi  sge    0 2008-05-27 08:58 error
> >>>>> -rw-r--r-- 1 camjayi  sge    0 2008-05-27 08:58 exit_status
> >>>>> -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 job_pid
> >>>>> -rw-r--r-- 1 sgeadmin sge 1240 2008-05-27 08:58 pe_hostfile
> >>>>> -rw-r--r-- 1 sgeadmin sge    4 2008-05-27 08:58 pid
> >>>>> -rw-r--r-- 1 camjayi  sge 4116 2008-05-27 08:58 trace
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >> 
> [15:14:43]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/186328.1
> >>>>>
> >>>>
> >>>>> ls -l 1.sub04n178/ total 24
> >>>>> -rw-r--r-- 1 sgeadmin sge    6 2008-05-27 08:58 addgrpid
> >>>>> -rw-r--r-- 1 sgeadmin sge 1891 2008-05-27 08:58 config
> >>>>> -rw-r--r-- 1 sgeadmin sge 1845 2008-05-27 08:58 environment
> >>>>> -rw-r--r-- 1 root     sge    0 2008-05-27 08:58 error
> >>>>> -rw-r--r-- 1 root     sge    0 2008-05-27 08:58 exit_status
> >>>>> -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 job_pid
> >>>>> -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 pid
> >>>>> -rw-r--r-- 1 root     sge 2665 2008-05-27 08:58 trace
> >>>>>
> >>>>>
> >>>>
> >> 
> [15:14:51]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/186328.1
> >>>>>
> >>>>
> >>>>> So, as you see in the active_jobs directory trace belongs
> >>>>>
> >>>> to user . It
> >>>>
> >>>>> is fine . But in subdirectory , in this example :
> >>>>>
> >>>> 1.sub04n178 trace is
> >>>>
> >>>>> root owned.
> >>>>>
> >>>>> And it is general behavior in the system.
> >>>>>
> >>>>> Regards,
> >>>>> v
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM]
> >>>>>> Sent: Tuesday, May 27, 2008 14:47
> >>>>>> To: users at gridengine.sunsource.net
> >>>>>> Subject: Re: [GE users] Plz help with strange shepherd message
> >>>>>>
> >>>>>> Check that the host where the file is generated has
> >> permission to
> >>>>>> write the to the /opt/SGE/spool/sub04n157/active_jobs
> >> directory as
> >>>>>> root.
> >>>>>>
> >>>>>> Daniel
> >>>>>>
> >>>>>> Viktor Oudovenko wrote:
> >>>>>>
> >>>>>>
> >>>>>>> HI,
> >>>>>>>
> >>>>>>> Recently I was playing with jobs suspension and wrote 
> >>>>>>> suspension/resume scripts and time after time (very often
> >>>>>>>
> >>>> it is OK)
> >>>>
> >>>>>>> for parallel jobs I see that in /tmp directory every minute
> >>>>>>>
> >>>>>>>
> >>>>>> one file
> >>>>>>
> >>>>>>
> >>>>>>> shephherd.XXXX, where XXXX is number is generated. Plz se
> >>>>>>>
> >>>>>>>
> >>>>>> below usual content of on of those files.
> >>>>>>
> >>>>>>
> >>>>>>> Plz let me know what might cause such kind of behavior.
> >>>>>>>
> >>>>>>> shepherd.30448
> >>>>>>> ::::::::::::::
> >>>>>>> 05/27/2008 02:48:11 [37394:37394 30448]: PANIC:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>
> >> 
> open(/opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/trace)
> >>>>
> >>>>>>
> >>>>>>
> >>>>>>> failed: Permission denied
> >>>>>>> 05/27/2008 02:48:11 [37394:37394 30448]: PANIC:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>
> >> 
> open(/opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/trace)
> >>>>
> >>>>>>
> >>>>>>
> >>>>>>> failed: Permission denied	
> >>>>>>>
> >>>>>>> Thank you very much for your help, Vic P.s.
> >> shepherd.XXXX has user
> >>>>>>> permission. User who runs job.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>
> >> 
> ---------------------------------------------------------------------
> >>>>
> >>>>>>
> >>>>>>
> >>>>>>> To unsubscribe, e-mail:
> >> users-unsubscribe at gridengine.sunsource.net
> >>>>>>> For additional commands, e-mail:
> >>>>>>>
> >>>> users-help at gridengine.sunsource.net
> >>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>
> >> 
> ---------------------------------------------------------------------
> >>>>
> >>>>>> To unsubscribe, e-mail:
> >> users-unsubscribe at gridengine.sunsource.net
> >>>>>> For additional commands, e-mail:
> >>>>>>
> >>>> users-help at gridengine.sunsource.net
> >>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >> 
> ---------------------------------------------------------------------
> >>>>
> >>>>> To unsubscribe, e-mail: 
> users-unsubscribe at gridengine.sunsource.net
> >>>>> For additional commands, e-mail:
> >> users-help at gridengine.sunsource.net
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >> 
> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: 
> users-unsubscribe at gridengine.sunsource.net
> >>>> For additional commands, e-mail:
> >> users-help at gridengine.sunsource.net
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >> 
> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: 
> users-help at gridengine.sunsource.net
> >>>
> >>>
> >>
> >> 
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: 
> users-help at gridengine.sunsource.net
> >>
> >
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list