[GE users] Problem with commd communications

Yogesh Chaudhary yogesh.chaudhary at amd.com
Mon Jun 14 22:47:54 BST 2004


Hi,

We have been having similar problem with commd..

here are the few things we do to solve it..

stop the execution hosts. stop the master host daemons. wait until the 
commd communication stops completely. use netstat -a | grep commd | wc -l

Then start the master host daemons and start the execution hosts.

Once, we found that there was a user who was trying to do qstat every 
second and this hosed commd...

I have seen this also when a user deletes large number of jobs...

But still we have this problem once a while..

thanks, Yogesh

  On Mon, 14 Jun 2004, Craig Tierney wrote:

> On Mon, 2004-06-14 at 15:33, Bernard Li wrote:
>> Hi Craig:
>>
>> Have you considered upgrading to the latest patch level (5.3p6) there
>> probably have been a lot of bug fixes since your version.
>
> Unless I missed something, none of the bug fixes listed in
> the change-logs addressed problems like this.
>
> I would like to upgrade, and I have thought of it.  That
> won't guarantee that it will fix the problem and could cause
> others.  We have been running SGE since September of 2002, but
> this only started becoming a problem in the last 4 months.  I would
> rather have confidence that the new software is going to fix
> the bug before just swapping and finding new things to worry about.
>
>
>>
>> Also, do the queues that you were having problems with give back any
>> error messages in their spool directories?
>
> I did not find any messages in the spool directories that
> provide any information.
>
>
>
>>
>> Cheers,
>>
>> Bernard
>>
>> -----Original Message-----
>> From: Craig Tierney [mailto:ctierney at hpti.com]
>> Sent: Monday, June 14, 2004 13:41
>> To: users at gridengine.sunsource.net
>> Subject: [GE users] Problem with commd communications
>>
>> I am having a problem with communications to my qmaster and it results
>> in jobs not being able to run.  I reported this a while back but got
>> distracted.  I have gathered up more information about the problem.
>>
>> At times on my system, the system gets in a state where jobs do not
>> start.  They go into the 't' state, but never make it to the 'r' state.
>>
>>
>> System config:
>> SGE v5.3p1
>> Intel Xeon server (2 GB RAM, 2.2 Ghz dual processor, fast ethernet) ~
>> 800 clients (775 Xeon, 12 Opteron, 12 Itanium)
>>      The Opteron systems are running the 32-bit binaries
>>      The Itanium systems are running v5.3p3
>>
>> Things I have noticed:
>>
>> - When the system is no longer able to run jobs, the load on
>>   commd on qmaster is around 100% the whole time.  Generally
>>   the load on commd < 10%.
>>
>> - Qmaster starts to report the messages:
>>
>> failed to deliver job 2401711.1 to queue "g0299.q"
>>
>> - When running strace on commd, I see numerous errors where a call to
>> read or write fails with "Resource temporarily unavailable"
>>
>> - When running sgecommdcntl I see several problems.  First, I see the
>> message 'Resource temporarily unavailable" numerous times.  Generally
>> they are for writes, but I have examples where the error happens during
>> reads. Here is an example:
>>
>> write2fd: message status=12 S_ACK_THEN_PROLOG send ack: 0 write prolog:
>> already written=0 write returned 14 written prolog write message:
>> len=133  already written=0 write returned 133 can read fd=4
>> readfromfd(4, 0x9657a20, 0, 535) message status=10
>> readfromfd: messagestatus=10
>> read ackchar=0
>> can write fd=6
>> write2fd: message status=14 S_WRITE
>> write message: len=1915795  already written=790608 write returned 104256
>> write returned -1 Resource temporarily unavailable
>>
>>
>> I don't know if these messages correlate exactly to the ones that I saw
>> from strace.
>>
>> Eventually the messages change.  Here is the beginning of the change:
>>
>> process received message portsec=0 commdport=535 fromfd=4 tofd=-1 found
>> sender host g0255
>> * send message
>> message to send: to=(g0061 qstat 1626) from=(g0255 qmaster 1) tag=2
>> len=6840 mid=49986 can write fd=4
>> write2fd: message status=17 S_WRITE_ACK_SND send ack: 0 target enrolled
>> on this host message waiting for receiver can write fd=5
>> write2fd: message status=15 S_WRITE_ACK
>> send ack: 0
>> rescheduling message mid=48269
>> target enrolled on this host
>> message waiting for receiver
>>
>>
>> At this point the last three messages will continue to repeat with
>> different 'mid' values.  I will get roughly 900 of these and then
>> sgecommdctrl will exit.  When things are not working properly,
>> sgecommdctrl will exit within 30 seconds with the messages above.  When
>> the system is working correctly, I never see this type of message.
>>
>> I suspect this could be a load issue on the server (lots of jobs, lots
>> of calls to qsub and qstat for monitoring jobs), but I cannot pin it
>> down to any one job or user.
>>
>> Thanks,
>> Craig
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

--------------------------------------------------------------------------
Yogesh Chaudhary
Advanced Micro Devices,Inc ( PCS )
9500 Arboretum Blvd., Suite 400                       Phone:  512.602.5422
Austin, TX 78759                                       Fax: 512.602.5051



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list