[GE users] Problem of SGE

reuti reuti at staff.uni-marburg.de
Fri Jul 9 17:13:49 BST 2010


Am 09.07.2010 um 17:59 schrieb kdoman:

> I take it back.
> 
> Looking into the node, I found the node's messages file at
> $SGE_ROOT/default/spool/node28/messages filled up 100% of the disk
> with these lines:

When the spool directory is shared between all nodes, it can indeed take the complete cluster down. Best is to have the spool directories local on each node e.g. at /var/spool/sge Appropriate subdirectories will be created automatically.

http://gridengine.sunsource.net/howto/nfsreduce.html

Besides this, there is a script /usr/sge/util/logchecker.sh You could also take the system standard logrotate, but the SGE supplied script will just discover the appropriate location whereever you messages file will be placed.

-- Reuti


> 
> 07/03/2010 09:35:29|  main|node28|W|get exit ack for pe task
> 1.compute-2-8 but task is not in state exiting
> 
> One I zeroed out the messages file, things work again. I think some of
> these nodes was taking parallel jobs before and the file system filled
> up. So when someone submitted the serial jobs, it threw the queue into
> error state?
> 
> 
> 
> On Fri, Jul 9, 2010 at 10:49 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>> Am 09.07.2010 um 17:25 schrieb kdoman:
>> 
>>> Hi Reuti -
>>> This is so odd! I don't recall my queue ever run into the error mode
>>> until recently, and the only thing I implemented  recently was the
>>> MPICH2 integration following your method.
>>> 
>>> The error is very random. One of my clusters has around 2000 serial
>>> jobs right now and last night almost 20% of the nodes ended up with
>>> the error in the queue. I ran "qmod -c" to clear out the error and
>>> this morning, some of the nodes had error again.
>> 
>> You mean it's even failing when there are no MPICH2 jobs at all? All what you did were stating entries, which won't affect the network communication at all.
>> 
>> -- Reuti
>> 
>> 
>>> K.
>>> 
>>> On Mon, Jul 5, 2010 at 4:43 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>>>> Hi,
>>>> 
>>>> Am 04.07.2010 um 17:08 schrieb gqc606:
>>>> 
>>>>> I installed SGE and MPICH2 on my computers,and integrated them with the following page: <http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html>
>>>>>  First it can work well,everything is all right.But thirty hours later,I got some wrong messages in this directory on one of my computer nodes:
>>>>> 
>>>>> [root at compute-0-0 ~]# cat /opt/gridengine/default/spool/compute-0-0/messages
>>>>> 06/25/2010 22:26:28| main|compute-0-0|E|can't send asynchronous message to commproc (qmaster:1) on host "cluster.local": can't resolve host name
>>>>> 06/25/2010 22:26:52| main|compute-0-0|E|commlib error: got select error (Connection reset by peer)
>>>> 
>>>> this doesn't look like being connected to the MPICH2 setup, but like a NIS problem. All hostnames can be resolved on all machines? The spool directory is on a shared directory, or are these local on each machine?
>>>> 
>>>> Only MPICH2 jobs are affected?
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> 
>>>>> I am confused,and don't know how to solve this problem.who can give me some advice?Thanks!
>>>>> 
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266025
>>>>> 
>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>> 
>>>> 
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266127
>>>> 
>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> <object type="application/x-shockwave-flash"
>>> data="https://clients4.google.com/voice/embed/webCallButton"
>>> width="230" height="85"><param name="movie"
>>> value="https://clients4.google.com/voice/embed/webCallButton" /><param
>>> name="wmode" value="transparent" /><param name="FlashVars"
>>> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
>>> /></object>
>>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266909
>>> 
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>> 
>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266915
>> 
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>> 
> 
> 
> 
> -- 
> <object type="application/x-shockwave-flash"
> data="https://clients4.google.com/voice/embed/webCallButton"
> width="230" height="85"><param name="movie"
> value="https://clients4.google.com/voice/embed/webCallButton" /><param
> name="wmode" value="transparent" /><param name="FlashVars"
> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
> /></object>
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266921
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266927

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list