[GE users] sge_shepherd eating up lots of cpu

brs brs at usf.edu
Thu Nov 20 17:58:44 GMT 2008


Oops... Pattern seems to be old job that no longer exists in the queue.  
Not sure how those shepherds are still hanging around.  Its no longer an 
issue for us, but if its interesting for anyone else here, let me know 
if you need more info.

-Brian

brs wrote:
> Hi, all,
>
> I've seen, in some instances, sge_shepherd using lots of CPU time:
>
> Output from 'top'
> ----
> 10757 root      16   0 83060 2092 1676 R  162  0.0  40665:51 
> sge_shepherd                                                                     
>
> 12422 root      16   0 83056 2084 1676 R  152  0.0  39756:57 
> sge_shepherd                                                                     
>
>  8700 root      16   0 83052 2080 1676 R  150  0.0  40704:49 sge_shepherd
>
> I attached strace to one of the processes and saw lots of this:
>
> ----
> strace -f -p <pid>
> ...
> [pid 12427] futex(0x51beed0, FUTEX_WAKE, 1) = 0
> [pid 12427] futex(0x51be3d0, FUTEX_WAKE, 1 <unfinished ...>
> [pid 12422] futex(0x51be3d0, FUTEX_WAIT, 2, NULL <unfinished ...>
> [pid 12427] <... futex resumed> )       = 0
> [pid 12422] <... futex resumed> )       = -1 EAGAIN (Resource 
> temporarily unavailable)
> [pid 12427] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
> [pid 12422] futex(0x51be3d0, FUTEX_WAKE, 1 <unfinished ...>
> [pid 12427] <... clock_gettime resumed> {1227202263, 839309000}) = 0
> [pid 12422] <... futex resumed> )       = 0
> [pid 12427] futex(0x51bef34, FUTEX_WAIT, 1156052373, {0, 999985000} 
> <unfinished ...>
> [pid 12422] futex(0x51bef34, FUTEX_WAKE_OP, 1, 1, 0x51bef30, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_EQ, 0} <unfinished ...>
> [pid 12427] <... futex resumed> )       = -1 EAGAIN (Resource 
> temporarily unavailable)
> [pid 12422] <... futex resumed> )       = 0
> [pid 12427] futex(0x51beed0, FUTEX_WAIT, 2, NULL <unfinished ...>
> [pid 12422] futex(0x51beed0, FUTEX_WAKE, 1 <unfinished ...>
> [pid 12427] <... futex resumed> )       = -1 EAGAIN (Resource 
> temporarily unavailable)
> [pid 12422] <... futex resumed> )       = 0
> [pid 12427] futex(0x51beed0, FUTEX_WAKE, 1) = 0
> [pid 12427] futex(0x51be3d0, FUTEX_WAKE, 1 <unfinished ...>
> [pid 12422] futex(0x51be3d0, FUTEX_WAIT, 2, NULL <unfinished ...>
> [pid 12427] <... futex resumed> )       = 0
> [pid 12422] <... futex resumed> )       = -1 EAGAIN (Resource 
> temporarily unavailable)
> [pid 12427] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
> [pid 12422] futex(0x51be3d0, FUTEX_WAKE, 1 <unfinished ...>
> [pid 12427] <... clock_gettime resumed> {1227202263, 839502000}) = 0
> [pid 12422] <... futex resumed> )       = 0
> [pid 12427] futex(0x51bef34, FUTEX_WAIT, 1156052375, {0, 999981000} 
> <unfinished ...>
> [pid 12422] futex(0x51bef34, FUTEX_WAKE_OP, 1, 1, 0x51bef30, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_EQ, 0} <unfinished ...>
> [pid 12427] <... futex resumed> )       = -1 EAGAIN (Resource 
> temporarily unavailable)
> ...
> Anyone have any clues for me?   I'll keep trying to diagnose here.  The 
> nodes have 8 slots, 1/cpu.  This particular execd host was running 6 
> SLAVE tasks for several parallel jobs.
>
> Im on 6.2... indeterminate update?  (seems that no longer shows up in 
> the version strings).  I downloaded this version from Sun's download site.
>
> -Brian Smith
>
>
>   


-- 
Brian Smith
HPC Systems Administrator
Research Computing, University of South Florida
4202 E. Fowler Ave. LIB618
Office Phone: +1 813 974-1467
Organization URL: http://rc.usf.edu

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89252

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list