[GE users] Problem of SGE

reuti reuti at staff.uni-marburg.de
Fri Jul 9 16:49:35 BST 2010


Am 09.07.2010 um 17:25 schrieb kdoman:

> Hi Reuti -
> This is so odd! I don't recall my queue ever run into the error mode
> until recently, and the only thing I implemented  recently was the
> MPICH2 integration following your method.
> 
> The error is very random. One of my clusters has around 2000 serial
> jobs right now and last night almost 20% of the nodes ended up with
> the error in the queue. I ran "qmod -c" to clear out the error and
> this morning, some of the nodes had error again.

You mean it's even failing when there are no MPICH2 jobs at all? All what you did were stating entries, which won't affect the network communication at all.

-- Reuti


> K.
> 
> On Mon, Jul 5, 2010 at 4:43 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>> Hi,
>> 
>> Am 04.07.2010 um 17:08 schrieb gqc606:
>> 
>>> I installed SGE and MPICH2 on my computers,and integrated them with the following page: <http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html>
>>>  First it can work well,everything is all right.But thirty hours later,I got some wrong messages in this directory on one of my computer nodes:
>>> 
>>> [root at compute-0-0 ~]# cat /opt/gridengine/default/spool/compute-0-0/messages
>>> 06/25/2010 22:26:28| main|compute-0-0|E|can't send asynchronous message to commproc (qmaster:1) on host "cluster.local": can't resolve host name
>>> 06/25/2010 22:26:52| main|compute-0-0|E|commlib error: got select error (Connection reset by peer)
>> 
>> this doesn't look like being connected to the MPICH2 setup, but like a NIS problem. All hostnames can be resolved on all machines? The spool directory is on a shared directory, or are these local on each machine?
>> 
>> Only MPICH2 jobs are affected?
>> 
>> -- Reuti
>> 
>> 
>>> 
>>> I am confused,and don't know how to solve this problem.who can give me some advice?Thanks!
>>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266025
>>> 
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>> 
>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266127
>> 
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>> 
> 
> 
> 
> -- 
> <object type="application/x-shockwave-flash"
> data="https://clients4.google.com/voice/embed/webCallButton"
> width="230" height="85"><param name="movie"
> value="https://clients4.google.com/voice/embed/webCallButton" /><param
> name="wmode" value="transparent" /><param name="FlashVars"
> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
> /></object>
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266909
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266915

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list