[GE users] Problem of SGE

kdoman kdoman07 at gmail.com
Sat Jul 17 16:16:15 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Reuti -
I think I got to the cause now, but still don't understand why:

I followed your method for the mpich2_mpd integration. After that, I
sent the short submit script for users to use to submit their parallel
jobs. We had one user who got so used to do things in the past where
if he wants to kill his job, he had to manually go to the individual
compute nodes to kill the mpd process. This time, instead of just
simple using qdel, he still went into the compute nodes and kill the
processes. Each time he does that, sge_qmaster process crashed.

Does this make any sense?

On Fri, Jul 9, 2010 at 11:13 AM, reuti <reuti at staff.uni-marburg.de> wrote:
> Am 09.07.2010 um 17:59 schrieb kdoman:
>
>> I take it back.
>>
>> Looking into the node, I found the node's messages file at
>> $SGE_ROOT/default/spool/node28/messages filled up 100% of the disk
>> with these lines:
>
> When the spool directory is shared between all nodes, it can indeed take the complete cluster down. Best is to have the spool directories local on each node e.g. at /var/spool/sge Appropriate subdirectories will be created automatically.
>
> http://gridengine.sunsource.net/howto/nfsreduce.html
>
> Besides this, there is a script /usr/sge/util/logchecker.sh You could also take the system standard logrotate, but the SGE supplied script will just discover the appropriate location whereever you messages file will be placed.
>
> -- Reuti
>
>
>>
>> 07/03/2010 09:35:29|  main|node28|W|get exit ack for pe task
>> 1.compute-2-8 but task is not in state exiting
>>
>> One I zeroed out the messages file, things work again. I think some of
>> these nodes was taking parallel jobs before and the file system filled
>> up. So when someone submitted the serial jobs, it threw the queue into
>> error state?
>>
>>
>>
>> On Fri, Jul 9, 2010 at 10:49 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>>> Am 09.07.2010 um 17:25 schrieb kdoman:
>>>
>>>> Hi Reuti -
>>>> This is so odd! I don't recall my queue ever run into the error mode
>>>> until recently, and the only thing I implemented  recently was the
>>>> MPICH2 integration following your method.
>>>>
>>>> The error is very random. One of my clusters has around 2000 serial
>>>> jobs right now and last night almost 20% of the nodes ended up with
>>>> the error in the queue. I ran "qmod -c" to clear out the error and
>>>> this morning, some of the nodes had error again.
>>>
>>> You mean it's even failing when there are no MPICH2 jobs at all? All what you did were stating entries, which won't affect the network communication at all.
>>>
>>> -- Reuti
>>>
>>>
>>>> K.
>>>>
>>>> On Mon, Jul 5, 2010 at 4:43 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>>>>> Hi,
>>>>>
>>>>> Am 04.07.2010 um 17:08 schrieb gqc606:
>>>>>
>>>>>> I installed SGE and MPICH2 on my computers,and integrated them with the following page: <http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html>
>>>>>>  First it can work well,everything is all right.But thirty hours later,I got some wrong messages in this directory on one of my computer nodes:
>>>>>>
>>>>>> [root at compute-0-0 ~]# cat /opt/gridengine/default/spool/compute-0-0/messages
>>>>>> 06/25/2010 22:26:28| main|compute-0-0|E|can't send asynchronous message to commproc (qmaster:1) on host "cluster.local": can't resolve host name
>>>>>> 06/25/2010 22:26:52| main|compute-0-0|E|commlib error: got select error (Connection reset by peer)
>>>>>
>>>>> this doesn't look like being connected to the MPICH2 setup, but like a NIS problem. All hostnames can be resolved on all machines? The spool directory is on a shared directory, or are these local on each machine?
>>>>>
>>>>> Only MPICH2 jobs are affected?
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>>
>>>>>> I am confused,and don't know how to solve this problem.who can give me some advice?Thanks!
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266025
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>>
>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266127
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> <object type="application/x-shockwave-flash"
>>>> data="https://clients4.google.com/voice/embed/webCallButton"
>>>> width="230" height="85"><param name="movie"
>>>> value="https://clients4.google.com/voice/embed/webCallButton" /><param
>>>> name="wmode" value="transparent" /><param name="FlashVars"
>>>> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
>>>> /></object>
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266909
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266915
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>
>>
>>
>>
>> --
>> <object type="application/x-shockwave-flash"
>> data="https://clients4.google.com/voice/embed/webCallButton"
>> width="230" height="85"><param name="movie"
>> value="https://clients4.google.com/voice/embed/webCallButton" /><param
>> name="wmode" value="transparent" /><param name="FlashVars"
>> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
>> /></object>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266921
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266927
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>



-- 
<object type="application/x-shockwave-flash"
data="https://clients4.google.com/voice/embed/webCallButton"
width="230" height="85"><param name="movie"
value="https://clients4.google.com/voice/embed/webCallButton" /><param
name="wmode" value="transparent" /><param name="FlashVars"
value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
/></object>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=268560

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list