[GE users] Plz help with strange shepherd message

Reuti reuti at staff.uni-marburg.de
Tue May 27 21:50:36 BST 2008


Am 27.05.2008 um 22:14 schrieb Viktor Oudovenko:

> Daniel,
>
> Thank you very much for your detailed answer.
> I never tried to compile the SGE code.
> I am going to update 6.0u4 to 6.1u4 hopping that this problem is gone.
> I'd say that this problem shows up only when I started to pay with
> suspend/resume stuff.
> Before everything was fine.

I would never dare to suspend a parallel appplication, unless it's  
supported with a special suspend command from the parallel lib out-of- 
the-box. Seems you are using MPICH1 and might face some race  
conditions while suspending the tasks on all the nodes - it's not  
done at the same time for all processes, so some messages might get  
lost.

If you have enough memory in the machine, it's by far more easier to  
give the parallel queue a priority (i.e. nice value) of 19, and  
normal jobs a 0. This would keep the parallel job running at a low  
rate, unless they are the only ones on the nodes as nice values are  
only relative when there are more processes running than cores are  
available.

-- Reuti


> I could provide with the following information:
>
> I have my own suspend /resume scripts.
>
> They are usually look like this (you've might get an idea wean  
> could be
> wrong).
>
> ::::::::::::::
> sge_resume.sh
> ::::::::::::::
> for file in sub04n149 sub04n151 sub04n152
> do
>   /usr/bin/rsh $file "ps axuf|grep -v grep|grep -v sge_suspend|grep -v
> sge_shepherd|grep -v job_scripts|grep 186328 | awk '{print \$2}' |  
> xargs
> kill -CONT "
> done
> exit 0
> #
> ::::::::::::::
> sge_suspend.sh
> ::::::::::::::
> for file in sub04n149 sub04n151 sub04n152
> do
>   /usr/bin/rsh $file "ps axuf|grep -v grep|grep -v sge_suspend|grep -v
> sge_shepherd|grep -v job_scripts|grep 186328 | awk '{print \$2}' |  
> xargs
> kill -STOP "
> done
> exit 0
> #
>
>
> Before I had only rsh instead of /usr/bin/rsh and it was problem is  
> suspend
> script suspended itself.
> Then I put /usr/bin/rsh and I thought that the problem is gone but  
> then I
> discovered it again.
> And as I said it does not show up always.
> It might have also something with qmaster restart.
>
> Regards,
> v
>
>
>
>> -----Original Message-----
>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM]
>> Sent: Tuesday, May 27, 2008 16:03
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Plz help with strange shepherd message
>>
>> I just had a peek at the source code, and the trace file
>> creation works like this:  If the file doesn't exist yet,
>> create it as root, and then if the job owner isn't root,
>> chown the file to the job owner and seteuid to the job owner;
>> if the file does exist, just open it.  The error message
>> you're seeing comes from the code segment that opens an
>> existing file.  The odd thing is that the shepherd should be
>> running as root at that point, so it shouldn't be having a
>> problem opening the file.
>>
>> Do you have the option to compile your own shepherd with
>> debugging information added?
>>
>> Daniel
>>
>>
>> Viktor Oudovenko wrote:
>>> Yes!
>>> Everything is fine with users.
>>> Moreover, in the example I gave below everything runs fine.
>>> I noticed problematic behavior even under my account when I
>> was logged
>>> in to machine and looked at the case.
>>> v
>>>
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM]
>>>> Sent: Tuesday, May 27, 2008 15:40
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] Plz help with strange shepherd message
>>>>
>>>> Does the given user exist on that machine?
>>>>
>>>> Daniel
>>>>
>>>> Viktor Oudovenko wrote:
>>>>
>>>>> Daniel,
>>>>>
>>>>> Root can write in any place. This is for sure.
>>>>> The problem is that in directory:
>>>>> /opt/SGE/spool/sub04n157/active_jobs/186117.1
>>>>> There is trace file which belongs to user but in subdirectory
>>>>> 1.sub04n157 (so the full path is
>>>>> /opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/ trace
>>>>> belongs to root).
>>>>> And shepherd.XXXX belongs to a user, so it is natural
>> that user can
>>>>> not right to file which belowns to root.
>>>>> The problem is why does the system try to do it?
>>>>>
>>>>> OK. To be more clrear here is example from another job but
>>>>>
>>>> it will be
>>>>
>>>>> clear seen permissions:
>>>>>
>>>>>
>>>>>
>>>>
>> [15:14:39]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/186328.1
>>>>>
>>>>
>>>>> ls -al total 32 drwxr-xr-x 3 sgeadmin sge  320 2008-05-27 08:58 .
>>>>> drwxr-xr-x 3 sgeadmin sge   72 2008-05-27 08:58 ..
>>>>> drwxr-xr-x 2 sgeadmin sge  256 2008-05-27 08:58 1.sub04n178
>>>>> -rw-r--r-- 1 sgeadmin sge    6 2008-05-27 08:58 addgrpid
>>>>> -rw-r--r-- 1 sgeadmin sge 1793 2008-05-27 08:58 config
>>>>> -rw-r--r-- 1 sgeadmin sge 1577 2008-05-27 08:58 environment
>>>>> -rw-r--r-- 1 camjayi  sge    0 2008-05-27 08:58 error
>>>>> -rw-r--r-- 1 camjayi  sge    0 2008-05-27 08:58 exit_status
>>>>> -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 job_pid
>>>>> -rw-r--r-- 1 sgeadmin sge 1240 2008-05-27 08:58 pe_hostfile
>>>>> -rw-r--r-- 1 sgeadmin sge    4 2008-05-27 08:58 pid
>>>>> -rw-r--r-- 1 camjayi  sge 4116 2008-05-27 08:58 trace
>>>>>
>>>>>
>>>>>
>>>>
>> [15:14:43]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/186328.1
>>>>>
>>>>
>>>>> ls -l 1.sub04n178/ total 24
>>>>> -rw-r--r-- 1 sgeadmin sge    6 2008-05-27 08:58 addgrpid
>>>>> -rw-r--r-- 1 sgeadmin sge 1891 2008-05-27 08:58 config
>>>>> -rw-r--r-- 1 sgeadmin sge 1845 2008-05-27 08:58 environment
>>>>> -rw-r--r-- 1 root     sge    0 2008-05-27 08:58 error
>>>>> -rw-r--r-- 1 root     sge    0 2008-05-27 08:58 exit_status
>>>>> -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 job_pid
>>>>> -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 pid
>>>>> -rw-r--r-- 1 root     sge 2665 2008-05-27 08:58 trace
>>>>>
>>>>>
>>>>
>> [15:14:51]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/186328.1
>>>>>
>>>>
>>>>> So, as you see in the active_jobs directory trace belongs
>>>>>
>>>> to user . It
>>>>
>>>>> is fine . But in subdirectory , in this example :
>>>>>
>>>> 1.sub04n178 trace is
>>>>
>>>>> root owned.
>>>>>
>>>>> And it is general behavior in the system.
>>>>>
>>>>> Regards,
>>>>> v
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM]
>>>>>> Sent: Tuesday, May 27, 2008 14:47
>>>>>> To: users at gridengine.sunsource.net
>>>>>> Subject: Re: [GE users] Plz help with strange shepherd message
>>>>>>
>>>>>> Check that the host where the file is generated has
>> permission to
>>>>>> write the to the /opt/SGE/spool/sub04n157/active_jobs
>> directory as
>>>>>> root.
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>> Viktor Oudovenko wrote:
>>>>>>
>>>>>>
>>>>>>> HI,
>>>>>>>
>>>>>>> Recently I was playing with jobs suspension and wrote
>>>>>>> suspension/resume scripts and time after time (very often
>>>>>>>
>>>> it is OK)
>>>>
>>>>>>> for parallel jobs I see that in /tmp directory every minute
>>>>>>>
>>>>>>>
>>>>>> one file
>>>>>>
>>>>>>
>>>>>>> shephherd.XXXX, where XXXX is number is generated. Plz se
>>>>>>>
>>>>>>>
>>>>>> below usual content of on of those files.
>>>>>>
>>>>>>
>>>>>>> Plz let me know what might cause such kind of behavior.
>>>>>>>
>>>>>>> shepherd.30448
>>>>>>> ::::::::::::::
>>>>>>> 05/27/2008 02:48:11 [37394:37394 30448]: PANIC:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>> open(/opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/trace)
>>>>
>>>>>>
>>>>>>
>>>>>>> failed: Permission denied
>>>>>>> 05/27/2008 02:48:11 [37394:37394 30448]: PANIC:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>> open(/opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/trace)
>>>>
>>>>>>
>>>>>>
>>>>>>> failed: Permission denied	
>>>>>>>
>>>>>>> Thank you very much for your help, Vic P.s.
>> shepherd.XXXX has user
>>>>>>> permission. User who runs job.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>> ---------------------------------------------------------------------
>>>>
>>>>>>
>>>>>>
>>>>>>> To unsubscribe, e-mail:
>> users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail:
>>>>>>>
>>>> users-help at gridengine.sunsource.net
>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>> ---------------------------------------------------------------------
>>>>
>>>>>> To unsubscribe, e-mail:
>> users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail:
>>>>>>
>>>> users-help at gridengine.sunsource.net
>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>> ---------------------------------------------------------------------
>>>>
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail:
>> users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>>
>>>>
>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail:
>> users-help at gridengine.sunsource.net
>>>>
>>>>
>>>
>>>
>>>
>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list