[GE users] Problem with commd communications

Bernard Li bli at bcgsc.ca
Mon Jun 14 22:33:18 BST 2004


Hi Craig:

Have you considered upgrading to the latest patch level (5.3p6) there
probably have been a lot of bug fixes since your version.

Also, do the queues that you were having problems with give back any
error messages in their spool directories?

Cheers,

Bernard 

-----Original Message-----
From: Craig Tierney [mailto:ctierney at hpti.com] 
Sent: Monday, June 14, 2004 13:41
To: users at gridengine.sunsource.net
Subject: [GE users] Problem with commd communications

I am having a problem with communications to my qmaster and it results
in jobs not being able to run.  I reported this a while back but got
distracted.  I have gathered up more information about the problem.

At times on my system, the system gets in a state where jobs do not
start.  They go into the 't' state, but never make it to the 'r' state.


System config:
SGE v5.3p1
Intel Xeon server (2 GB RAM, 2.2 Ghz dual processor, fast ethernet) ~
800 clients (775 Xeon, 12 Opteron, 12 Itanium)
     The Opteron systems are running the 32-bit binaries
     The Itanium systems are running v5.3p3

Things I have noticed:

- When the system is no longer able to run jobs, the load on
  commd on qmaster is around 100% the whole time.  Generally
  the load on commd < 10%.

- Qmaster starts to report the messages:

failed to deliver job 2401711.1 to queue "g0299.q"

- When running strace on commd, I see numerous errors where a call to
read or write fails with "Resource temporarily unavailable"

- When running sgecommdcntl I see several problems.  First, I see the
message 'Resource temporarily unavailable" numerous times.  Generally
they are for writes, but I have examples where the error happens during
reads. Here is an example:

write2fd: message status=12 S_ACK_THEN_PROLOG send ack: 0 write prolog:
already written=0 write returned 14 written prolog write message:
len=133  already written=0 write returned 133 can read fd=4
readfromfd(4, 0x9657a20, 0, 535) message status=10
readfromfd: messagestatus=10
read ackchar=0
can write fd=6
write2fd: message status=14 S_WRITE
write message: len=1915795  already written=790608 write returned 104256
write returned -1 Resource temporarily unavailable


I don't know if these messages correlate exactly to the ones that I saw
from strace.  

Eventually the messages change.  Here is the beginning of the change:

process received message portsec=0 commdport=535 fromfd=4 tofd=-1 found
sender host g0255
* send message
message to send: to=(g0061 qstat 1626) from=(g0255 qmaster 1) tag=2
len=6840 mid=49986 can write fd=4
write2fd: message status=17 S_WRITE_ACK_SND send ack: 0 target enrolled
on this host message waiting for receiver can write fd=5
write2fd: message status=15 S_WRITE_ACK
send ack: 0
rescheduling message mid=48269
target enrolled on this host
message waiting for receiver


At this point the last three messages will continue to repeat with
different 'mid' values.  I will get roughly 900 of these and then
sgecommdctrl will exit.  When things are not working properly,
sgecommdctrl will exit within 30 seconds with the messages above.  When
the system is working correctly, I never see this type of message.

I suspect this could be a load issue on the server (lots of jobs, lots
of calls to qsub and qstat for monitoring jobs), but I cannot pin it
down to any one job or user.

Thanks,
Craig


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list