[GE users] Problem of SGE

kdoman kdoman07 at gmail.com
Fri Jul 16 20:09:04 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hello -
I this thing came up again today on the same cluster running GE 6.2u4
This time, there is no filesystem filling to 100% as before. I am
searching for this "commlib error: got select error" elsewhere, but
didn't find much helpful info.

Stucked!


On Fri, Jul 9, 2010 at 11:13 AM, reuti <reuti at staff.uni-marburg.de> wrote:
> Am 09.07.2010 um 17:59 schrieb kdoman:
>
>> I take it back.
>>
>> Looking into the node, I found the node's messages file at
>> $SGE_ROOT/default/spool/node28/messages filled up 100% of the disk
>> with these lines:
>
> When the spool directory is shared between all nodes, it can indeed take the complete cluster down. Best is to have the spool directories local on each node e.g. at /var/spool/sge Appropriate subdirectories will be created automatically.
>
> http://gridengine.sunsource.net/howto/nfsreduce.html
>
> Besides this, there is a script /usr/sge/util/logchecker.sh You could also take the system standard logrotate, but the SGE supplied script will just discover the appropriate location whereever you messages file will be placed.
>
> -- Reuti
>
>
>>
>> 07/03/2010 09:35:29|  main|node28|W|get exit ack for pe task
>> 1.compute-2-8 but task is not in state exiting
>>
>> One I zeroed out the messages file, things work again. I think some of
>> these nodes was taking parallel jobs before and the file system filled
>> up. So when someone submitted the serial jobs, it threw the queue into
>> error state?
>>
>>
>>
>> On Fri, Jul 9, 2010 at 10:49 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>>> Am 09.07.2010 um 17:25 schrieb kdoman:
>>>
>>>> Hi Reuti -
>>>> This is so odd! I don't recall my queue ever run into the error mode
>>>> until recently, and the only thing I implemented  recently was the
>>>> MPICH2 integration following your method.
>>>>
>>>> The error is very random. One of my clusters has around 2000 serial
>>>> jobs right now and last night almost 20% of the nodes ended up with
>>>> the error in the queue. I ran "qmod -c" to clear out the error and
>>>> this morning, some of the nodes had error again.
>>>
>>> You mean it's even failing when there are no MPICH2 jobs at all? All what you did were stating entries, which won't affect the network communication at all.
>>>
>>> -- Reuti
>>>
>>>
>>>> K.
>>>>
>>>> On Mon, Jul 5, 2010 at 4:43 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>>>>> Hi,
>>>>>
>>>>> Am 04.07.2010 um 17:08 schrieb gqc606:
>>>>>
>>>>>> I installed SGE and MPICH2 on my computers,and integrated them with the following page: <http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html>
>>>>>>  First it can work well,everything is all right.But thirty hours later,I got some wrong messages in this directory on one of my computer nodes:
>>>>>>
>>>>>> [root at compute-0-0 ~]# cat /opt/gridengine/default/spool/compute-0-0/messages
>>>>>> 06/25/2010 22:26:28| main|compute-0-0|E|can't send asynchronous message to commproc (qmaster:1) on host "cluster.local": can't resolve host name
>>>>>> 06/25/2010 22:26:52| main|compute-0-0|E|commlib error: got select error (Connection reset by peer)
>>>>>
>>>>> this doesn't look like being connected to the MPICH2 setup, but like a NIS problem. All hostnames can be resolved on all machines? The spool directory is on a shared directory, or are these local on each machine?
>>>>>
>>>>> Only MPICH2 jobs are affected?
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>>
>>>>>> I am confused,and don't know how to solve this problem.who can give me some advice?Thanks!
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266025
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>>
>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266127
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> <object type="application/x-shockwave-flash"
>>>> data="https://clients4.google.com/voice/embed/webCallButton"
>>>> width="230" height="85"><param name="movie"
>>>> value="https://clients4.google.com/voice/embed/webCallButton" /><param
>>>> name="wmode" value="transparent" /><param name="FlashVars"
>>>> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
>>>> /></object>
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266909
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266915
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>
>>
>>
>>
>> --
>> <object type="application/x-shockwave-flash"
>> data="https://clients4.google.com/voice/embed/webCallButton"
>> width="230" height="85"><param name="movie"
>> value="https://clients4.google.com/voice/embed/webCallButton" /><param
>> name="wmode" value="transparent" /><param name="FlashVars"
>> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
>> /></object>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266921
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266927
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>



-- 
<object type="application/x-shockwave-flash"
data="https://clients4.google.com/voice/embed/webCallButton"
width="230" height="85"><param name="movie"
value="https://clients4.google.com/voice/embed/webCallButton" /><param
name="wmode" value="transparent" /><param name="FlashVars"
value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
/></object>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=268405

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list