[GE users] Problem with commd communications

Craig Tierney ctierney at hpti.com
Mon Jun 14 21:41:01 BST 2004

I am having a problem with communications to my qmaster
and it results in jobs not being able to run.  I reported
this a while back but got distracted.  I have gathered up 
more information about the problem.

At times on my system, the system gets in a state where
jobs do not start.  They go into the 't' state, but never
make it to the 'r' state.  

System config:
SGE v5.3p1
Intel Xeon server (2 GB RAM, 2.2 Ghz dual processor, fast ethernet)
~ 800 clients (775 Xeon, 12 Opteron, 12 Itanium)
     The Opteron systems are running the 32-bit binaries
     The Itanium systems are running v5.3p3

Things I have noticed:

- When the system is no longer able to run jobs, the load on
  commd on qmaster is around 100% the whole time.  Generally
  the load on commd < 10%.

- Qmaster starts to report the messages:

failed to deliver job 2401711.1 to queue "g0299.q"

- When running strace on commd, I see numerous errors where
a call to read or write fails with "Resource temporarily 

- When running sgecommdcntl I see several problems.  First,
I see the message 'Resource temporarily unavailable" numerous
times.  Generally they are for writes, but I have examples
where the error happens during reads. Here is an example:

write2fd: message status=12 S_ACK_THEN_PROLOG
send ack: 0
write prolog: already written=0
write returned 14
written prolog
write message: len=133  already written=0
write returned 133 
can read fd=4
readfromfd(4, 0x9657a20, 0, 535)
message status=10
readfromfd: messagestatus=10
read ackchar=0
can write fd=6
write2fd: message status=14 S_WRITE
write message: len=1915795  already written=790608
write returned 104256 
write returned -1 Resource temporarily unavailable

I don't know if these messages correlate exactly to the ones that
I saw from strace.  

Eventually the messages change.  Here is the beginning of
the change:

process received message portsec=0 commdport=535 fromfd=4 tofd=-1
found sender host g0255
* send message
message to send: to=(g0061 qstat 1626) from=(g0255 qmaster 1) tag=2
len=6840 mid=49986
can write fd=4
write2fd: message status=17 S_WRITE_ACK_SND
send ack: 0
target enrolled on this host
message waiting for receiver
can write fd=5
write2fd: message status=15 S_WRITE_ACK
send ack: 0
rescheduling message mid=48269
target enrolled on this host
message waiting for receiver

At this point the last three messages will continue to repeat
with different 'mid' values.  I will get roughly 900 of these
and then sgecommdctrl will exit.  When things are not working
properly, sgecommdctrl will exit within 30 seconds with the messages
above.  When the system is working correctly, I never see this
type of message.

I suspect this could be a load issue on the server (lots
of jobs, lots of calls to qsub and qstat for monitoring jobs),
but I cannot pin it down to any one job or user.


To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list