[GE users] Problem of SGE

kdoman kdoman07 at gmail.com
Fri Jul 9 16:59:22 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

I take it back.

Looking into the node, I found the node's messages file at
$SGE_ROOT/default/spool/node28/messages filled up 100% of the disk
with these lines:

07/03/2010 09:35:29|  main|node28|W|get exit ack for pe task
1.compute-2-8 but task is not in state exiting

One I zeroed out the messages file, things work again. I think some of
these nodes was taking parallel jobs before and the file system filled
up. So when someone submitted the serial jobs, it threw the queue into
error state?



On Fri, Jul 9, 2010 at 10:49 AM, reuti <reuti at staff.uni-marburg.de> wrote:
> Am 09.07.2010 um 17:25 schrieb kdoman:
>
>> Hi Reuti -
>> This is so odd! I don't recall my queue ever run into the error mode
>> until recently, and the only thing I implemented  recently was the
>> MPICH2 integration following your method.
>>
>> The error is very random. One of my clusters has around 2000 serial
>> jobs right now and last night almost 20% of the nodes ended up with
>> the error in the queue. I ran "qmod -c" to clear out the error and
>> this morning, some of the nodes had error again.
>
> You mean it's even failing when there are no MPICH2 jobs at all? All what you did were stating entries, which won't affect the network communication at all.
>
> -- Reuti
>
>
>> K.
>>
>> On Mon, Jul 5, 2010 at 4:43 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>>> Hi,
>>>
>>> Am 04.07.2010 um 17:08 schrieb gqc606:
>>>
>>>> I installed SGE and MPICH2 on my computers,and integrated them with the following page: <http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html>
>>>>  First it can work well,everything is all right.But thirty hours later,I got some wrong messages in this directory on one of my computer nodes:
>>>>
>>>> [root at compute-0-0 ~]# cat /opt/gridengine/default/spool/compute-0-0/messages
>>>> 06/25/2010 22:26:28| main|compute-0-0|E|can't send asynchronous message to commproc (qmaster:1) on host "cluster.local": can't resolve host name
>>>> 06/25/2010 22:26:52| main|compute-0-0|E|commlib error: got select error (Connection reset by peer)
>>>
>>> this doesn't look like being connected to the MPICH2 setup, but like a NIS problem. All hostnames can be resolved on all machines? The spool directory is on a shared directory, or are these local on each machine?
>>>
>>> Only MPICH2 jobs are affected?
>>>
>>> -- Reuti
>>>
>>>
>>>>
>>>> I am confused,and don't know how to solve this problem.who can give me some advice?Thanks!
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266025
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266127
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>
>>
>>
>>
>> --
>> <object type="application/x-shockwave-flash"
>> data="https://clients4.google.com/voice/embed/webCallButton"
>> width="230" height="85"><param name="movie"
>> value="https://clients4.google.com/voice/embed/webCallButton" /><param
>> name="wmode" value="transparent" /><param name="FlashVars"
>> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
>> /></object>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266909
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266915
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>



-- 
<object type="application/x-shockwave-flash"
data="https://clients4.google.com/voice/embed/webCallButton"
width="230" height="85"><param name="movie"
value="https://clients4.google.com/voice/embed/webCallButton" /><param
name="wmode" value="transparent" /><param name="FlashVars"
value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
/></object>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266921

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list