[GE users] Plz help with strange shepherd message

Daniel Templeton Dan.Templeton at Sun.COM
Tue May 27 21:39:25 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Check out issue 1752:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1752

Daniel

Viktor Oudovenko wrote:
> Daniel,
>
> Thank you very much for your detailed answer.
> I never tried to compile the SGE code.
> I am going to update 6.0u4 to 6.1u4 hopping that this problem is gone.
> I'd say that this problem shows up only when I started to pay with
> suspend/resume stuff.
> Before everything was fine.
>
> I could provide with the following information:
>
> I have my own suspend /resume scripts.
>
> They are usually look like this (you've might get an idea wean could be
> wrong).
>
> ::::::::::::::
> sge_resume.sh
> ::::::::::::::
> for file in sub04n149 sub04n151 sub04n152 
> do
>   /usr/bin/rsh $file "ps axuf|grep -v grep|grep -v sge_suspend|grep -v
> sge_shepherd|grep -v job_scripts|grep 186328 | awk '{print \$2}' | xargs
> kill -CONT "
> done
> exit 0
> #
> ::::::::::::::
> sge_suspend.sh
> ::::::::::::::
> for file in sub04n149 sub04n151 sub04n152 
> do
>   /usr/bin/rsh $file "ps axuf|grep -v grep|grep -v sge_suspend|grep -v
> sge_shepherd|grep -v job_scripts|grep 186328 | awk '{print \$2}' | xargs
> kill -STOP "
> done
> exit 0
> #
>
>
> Before I had only rsh instead of /usr/bin/rsh and it was problem is suspend
> script suspended itself.
> Then I put /usr/bin/rsh and I thought that the problem is gone but then I
> discovered it again.
> And as I said it does not show up always.
> It might have also something with qmaster restart.
>
> Regards,
> v
>
>  
>
>   
>> -----Original Message-----
>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
>> Sent: Tuesday, May 27, 2008 16:03
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Plz help with strange shepherd message
>>
>> I just had a peek at the source code, and the trace file 
>> creation works like this:  If the file doesn't exist yet, 
>> create it as root, and then if the job owner isn't root, 
>> chown the file to the job owner and seteuid to the job owner; 
>> if the file does exist, just open it.  The error message 
>> you're seeing comes from the code segment that opens an 
>> existing file.  The odd thing is that the shepherd should be 
>> running as root at that point, so it shouldn't be having a 
>> problem opening the file.
>>
>> Do you have the option to compile your own shepherd with 
>> debugging information added?
>>
>> Daniel
>>
>>
>> Viktor Oudovenko wrote:
>>     
>>> Yes!
>>> Everything is fine with users.
>>> Moreover, in the example I gave below everything runs fine.
>>> I noticed problematic behavior even under my account when I 
>>>       
>> was logged  
>>     
>>> in to machine and looked at the case.
>>> v   
>>>
>>>
>>>
>>>   
>>>       
>>>> -----Original Message-----
>>>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM]
>>>> Sent: Tuesday, May 27, 2008 15:40
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] Plz help with strange shepherd message
>>>>
>>>> Does the given user exist on that machine?
>>>>
>>>> Daniel
>>>>
>>>> Viktor Oudovenko wrote:
>>>>     
>>>>         
>>>>> Daniel,
>>>>>
>>>>> Root can write in any place. This is for sure.
>>>>> The problem is that in directory:
>>>>> /opt/SGE/spool/sub04n157/active_jobs/186117.1
>>>>> There is trace file which belongs to user but in subdirectory
>>>>> 1.sub04n157 (so the full path is
>>>>> /opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/ trace 
>>>>> belongs to root).
>>>>> And shepherd.XXXX belongs to a user, so it is natural 
>>>>>           
>> that user can 
>>     
>>>>> not right to file which belowns to root.
>>>>> The problem is why does the system try to do it?
>>>>>
>>>>> OK. To be more clrear here is example from another job but
>>>>>       
>>>>>           
>>>> it will be
>>>>     
>>>>         
>>>>> clear seen permissions:
>>>>>
>>>>>
>>>>>       
>>>>>           
>> [15:14:39]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/186328.1
>>     
>>>>     
>>>>         
>>>>> ls -al total 32 drwxr-xr-x 3 sgeadmin sge  320 2008-05-27 08:58 .
>>>>> drwxr-xr-x 3 sgeadmin sge   72 2008-05-27 08:58 ..
>>>>> drwxr-xr-x 2 sgeadmin sge  256 2008-05-27 08:58 1.sub04n178
>>>>> -rw-r--r-- 1 sgeadmin sge    6 2008-05-27 08:58 addgrpid
>>>>> -rw-r--r-- 1 sgeadmin sge 1793 2008-05-27 08:58 config
>>>>> -rw-r--r-- 1 sgeadmin sge 1577 2008-05-27 08:58 environment
>>>>> -rw-r--r-- 1 camjayi  sge    0 2008-05-27 08:58 error
>>>>> -rw-r--r-- 1 camjayi  sge    0 2008-05-27 08:58 exit_status
>>>>> -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 job_pid
>>>>> -rw-r--r-- 1 sgeadmin sge 1240 2008-05-27 08:58 pe_hostfile
>>>>> -rw-r--r-- 1 sgeadmin sge    4 2008-05-27 08:58 pid
>>>>> -rw-r--r-- 1 camjayi  sge 4116 2008-05-27 08:58 trace
>>>>>
>>>>>
>>>>>       
>>>>>           
>> [15:14:43]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/186328.1
>>     
>>>>     
>>>>         
>>>>> ls -l 1.sub04n178/ total 24
>>>>> -rw-r--r-- 1 sgeadmin sge    6 2008-05-27 08:58 addgrpid
>>>>> -rw-r--r-- 1 sgeadmin sge 1891 2008-05-27 08:58 config
>>>>> -rw-r--r-- 1 sgeadmin sge 1845 2008-05-27 08:58 environment
>>>>> -rw-r--r-- 1 root     sge    0 2008-05-27 08:58 error
>>>>> -rw-r--r-- 1 root     sge    0 2008-05-27 08:58 exit_status
>>>>> -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 job_pid
>>>>> -rw-r--r-- 1 sgeadmin sge    5 2008-05-27 08:58 pid
>>>>> -rw-r--r-- 1 root     sge 2665 2008-05-27 08:58 trace
>>>>>
>>>>>       
>>>>>           
>> [15:14:51]udo at sub04n178:/opt/SGE/spool/sub04n178/active_jobs/186328.1
>>     
>>>>     
>>>>         
>>>>> So, as you see in the active_jobs directory trace belongs
>>>>>       
>>>>>           
>>>> to user . It
>>>>     
>>>>         
>>>>> is fine . But in subdirectory , in this example : 
>>>>>       
>>>>>           
>>>> 1.sub04n178 trace is
>>>>     
>>>>         
>>>>> root owned.
>>>>>
>>>>> And it is general behavior in the system. 
>>>>>
>>>>> Regards,
>>>>> v
>>>>>
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>> -----Original Message-----
>>>>>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM]
>>>>>> Sent: Tuesday, May 27, 2008 14:47
>>>>>> To: users at gridengine.sunsource.net
>>>>>> Subject: Re: [GE users] Plz help with strange shepherd message
>>>>>>
>>>>>> Check that the host where the file is generated has 
>>>>>>             
>> permission to 
>>     
>>>>>> write the to the /opt/SGE/spool/sub04n157/active_jobs 
>>>>>>             
>> directory as 
>>     
>>>>>> root.
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>> Viktor Oudovenko wrote:
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>>> HI,
>>>>>>>
>>>>>>> Recently I was playing with jobs suspension and wrote 
>>>>>>> suspension/resume scripts and time after time (very often
>>>>>>>           
>>>>>>>               
>>>> it is OK)
>>>>     
>>>>         
>>>>>>> for parallel jobs I see that in /tmp directory every minute
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>> one file
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>>> shephherd.XXXX, where XXXX is number is generated. Plz se
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>> below usual content of on of those files.
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>>> Plz let me know what might cause such kind of behavior.
>>>>>>>
>>>>>>> shepherd.30448
>>>>>>> ::::::::::::::
>>>>>>> 05/27/2008 02:48:11 [37394:37394 30448]: PANIC:
>>>>>>>
>>>>>>>       
>>>>>>>           
>>>>>>>               
>> open(/opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/trace)
>>     
>>>>     
>>>>         
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>>> failed: Permission denied
>>>>>>> 05/27/2008 02:48:11 [37394:37394 30448]: PANIC:
>>>>>>>
>>>>>>>       
>>>>>>>           
>>>>>>>               
>> open(/opt/SGE/spool/sub04n157/active_jobs/186117.1/1.sub04n157/trace)
>>     
>>>>     
>>>>         
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>>> failed: Permission denied	 
>>>>>>>
>>>>>>> Thank you very much for your help, Vic P.s. 
>>>>>>>               
>> shepherd.XXXX has user 
>>     
>>>>>>> permission. User who runs job.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>       
>>>>>>>           
>>>>>>>               
>> ---------------------------------------------------------------------
>>     
>>>>     
>>>>         
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>>> To unsubscribe, e-mail: 
>>>>>>>               
>> users-unsubscribe at gridengine.sunsource.net
>>     
>>>>>>> For additional commands, e-mail: 
>>>>>>>           
>>>>>>>               
>>>> users-help at gridengine.sunsource.net
>>>>     
>>>>         
>>>>>>>   
>>>>>>>       
>>>>>>>           
>>>>>>>               
>> ---------------------------------------------------------------------
>>     
>>>>     
>>>>         
>>>>>> To unsubscribe, e-mail: 
>>>>>>             
>> users-unsubscribe at gridengine.sunsource.net
>>     
>>>>>> For additional commands, e-mail: 
>>>>>>         
>>>>>>             
>>>> users-help at gridengine.sunsource.net
>>>>     
>>>>         
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>       
>>>>>           
>> ---------------------------------------------------------------------
>>     
>>>>     
>>>>         
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: 
>>>>>           
>> users-help at gridengine.sunsource.net
>>     
>>>>>   
>>>>>       
>>>>>           
>> ---------------------------------------------------------------------
>>     
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: 
>>>>         
>> users-help at gridengine.sunsource.net
>>     
>>>>     
>>>>         
>>>
>>>       
>> ---------------------------------------------------------------------
>>     
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>   
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>     
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list