[GE users] sge_shepherd eating up lots of cpu

brs brs at usf.edu
Thu Nov 20 17:46:48 GMT 2008


Hi, all,

I've seen, in some instances, sge_shepherd using lots of CPU time:

Output from 'top'
----
10757 root      16   0 83060 2092 1676 R  162  0.0  40665:51 
sge_shepherd                                                                     

12422 root      16   0 83056 2084 1676 R  152  0.0  39756:57 
sge_shepherd                                                                     

 8700 root      16   0 83052 2080 1676 R  150  0.0  40704:49 sge_shepherd

I attached strace to one of the processes and saw lots of this:

----
strace -f -p <pid>
...
[pid 12427] futex(0x51beed0, FUTEX_WAKE, 1) = 0
[pid 12427] futex(0x51be3d0, FUTEX_WAKE, 1 <unfinished ...>
[pid 12422] futex(0x51be3d0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 12427] <... futex resumed> )       = 0
[pid 12422] <... futex resumed> )       = -1 EAGAIN (Resource 
temporarily unavailable)
[pid 12427] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 12422] futex(0x51be3d0, FUTEX_WAKE, 1 <unfinished ...>
[pid 12427] <... clock_gettime resumed> {1227202263, 839309000}) = 0
[pid 12422] <... futex resumed> )       = 0
[pid 12427] futex(0x51bef34, FUTEX_WAIT, 1156052373, {0, 999985000} 
<unfinished ...>
[pid 12422] futex(0x51bef34, FUTEX_WAKE_OP, 1, 1, 0x51bef30, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_EQ, 0} <unfinished ...>
[pid 12427] <... futex resumed> )       = -1 EAGAIN (Resource 
temporarily unavailable)
[pid 12422] <... futex resumed> )       = 0
[pid 12427] futex(0x51beed0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 12422] futex(0x51beed0, FUTEX_WAKE, 1 <unfinished ...>
[pid 12427] <... futex resumed> )       = -1 EAGAIN (Resource 
temporarily unavailable)
[pid 12422] <... futex resumed> )       = 0
[pid 12427] futex(0x51beed0, FUTEX_WAKE, 1) = 0
[pid 12427] futex(0x51be3d0, FUTEX_WAKE, 1 <unfinished ...>
[pid 12422] futex(0x51be3d0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 12427] <... futex resumed> )       = 0
[pid 12422] <... futex resumed> )       = -1 EAGAIN (Resource 
temporarily unavailable)
[pid 12427] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 12422] futex(0x51be3d0, FUTEX_WAKE, 1 <unfinished ...>
[pid 12427] <... clock_gettime resumed> {1227202263, 839502000}) = 0
[pid 12422] <... futex resumed> )       = 0
[pid 12427] futex(0x51bef34, FUTEX_WAIT, 1156052375, {0, 999981000} 
<unfinished ...>
[pid 12422] futex(0x51bef34, FUTEX_WAKE_OP, 1, 1, 0x51bef30, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_EQ, 0} <unfinished ...>
[pid 12427] <... futex resumed> )       = -1 EAGAIN (Resource 
temporarily unavailable)
...
Anyone have any clues for me?   I'll keep trying to diagnose here.  The 
nodes have 8 slots, 1/cpu.  This particular execd host was running 6 
SLAVE tasks for several parallel jobs.

Im on 6.2... indeterminate update?  (seems that no longer shows up in 
the version strings).  I downloaded this version from Sun's download site.

-Brian Smith


-- 
Brian Smith
HPC Systems Administrator
Research Computing, University of South Florida
4202 E. Fowler Ave. LIB618
Office Phone: +1 813 974-1467
Organization URL: http://rc.usf.edu

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89250

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list