[GE users] Problem of SGE

kdoman kdoman07 at gmail.com
Fri Jul 9 16:25:59 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi Reuti -
This is so odd! I don't recall my queue ever run into the error mode
until recently, and the only thing I implemented  recently was the
MPICH2 integration following your method.

The error is very random. One of my clusters has around 2000 serial
jobs right now and last night almost 20% of the nodes ended up with
the error in the queue. I ran "qmod -c" to clear out the error and
this morning, some of the nodes had error again.

K.

On Mon, Jul 5, 2010 at 4:43 AM, reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
>
> Am 04.07.2010 um 17:08 schrieb gqc606:
>
>> I installed SGE and MPICH2 on my computers,and integrated them with the following page: <http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html>
>>  First it can work well,everything is all right.But thirty hours later,I got some wrong messages in this directory on one of my computer nodes:
>>
>> [root at compute-0-0 ~]# cat /opt/gridengine/default/spool/compute-0-0/messages
>> 06/25/2010 22:26:28| main|compute-0-0|E|can't send asynchronous message to commproc (qmaster:1) on host "cluster.local": can't resolve host name
>> 06/25/2010 22:26:52| main|compute-0-0|E|commlib error: got select error (Connection reset by peer)
>
> this doesn't look like being connected to the MPICH2 setup, but like a NIS problem. All hostnames can be resolved on all machines? The spool directory is on a shared directory, or are these local on each machine?
>
> Only MPICH2 jobs are affected?
>
> -- Reuti
>
>
>>
>> I am confused,and don't know how to solve this problem.who can give me some advice?Thanks!
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266025
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266127
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>



-- 
<object type="application/x-shockwave-flash"
data="https://clients4.google.com/voice/embed/webCallButton"
width="230" height="85"><param name="movie"
value="https://clients4.google.com/voice/embed/webCallButton" /><param
name="wmode" value="transparent" /><param name="FlashVars"
value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
/></object>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266909

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list