[GE users] Problem of SGE

kdoman kdoman07 at gmail.com
Mon Jul 19 17:57:39 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

I think I fixed it.

The mpich2_mpd/ owner/permission on the submit hosts was root:root
755, I changed it to sge:sge 755 and things are fine now! All the
compute nodes already had sge:sge 755.

I hope it still make sense to you.

Thanks!
K.


On Mon, Jul 19, 2010 at 4:15 AM, reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
>
> Am 17.07.2010 um 17:16 schrieb kdoman:
>
>> Reuti -
>> I think I got to the cause now, but still don't understand why:
>>
>> I followed your method for the mpich2_mpd integration. After that, I
>> sent the short submit script for users to use to submit their parallel
>> jobs. We had one user who got so used to do things in the past where
>> if he wants to kill his job, he had to manually go to the individual
>> compute nodes to kill the mpd process. This time, instead of just
>> simple using qdel, he still went into the compute nodes and kill the
>> processes. Each time he does that, sge_qmaster process crashed.
>>
>> Does this make any sense?
>
> no, this is not supposed to happen. On the one hand, a `qdel` should be sufficient for a tightly integrated job to be removed in a clean way. On the other hand, killing any job on a node (i.e. its process) shouldn't crash the qmaster.
>
> Does the user kill just the process on one of the parallel slave nodes and as a result the qmaster dies? The qmaster and execd are running under the root account? Is there anything in the qmaster messages file (or the one of the execd)?
>
> -- Reuti
>
>
>> On Fri, Jul 9, 2010 at 11:13 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>>> Am 09.07.2010 um 17:59 schrieb kdoman:
>>>
>>>> I take it back.
>>>>
>>>> Looking into the node, I found the node's messages file at
>>>> $SGE_ROOT/default/spool/node28/messages filled up 100% of the disk
>>>> with these lines:
>>>
>>> When the spool directory is shared between all nodes, it can indeed take the complete cluster down. Best is to have the spool directories local on each node e.g. at /var/spool/sge Appropriate subdirectories will be created automatically.
>>>
>>> http://gridengine.sunsource.net/howto/nfsreduce.html
>>>
>>> Besides this, there is a script /usr/sge/util/logchecker.sh You could also take the system standard logrotate, but the SGE supplied script will just discover the appropriate location whereever you messages file will be placed.
>>>
>>> -- Reuti
>>>
>>>
>>>>
>>>> 07/03/2010 09:35:29|  main|node28|W|get exit ack for pe task
>>>> 1.compute-2-8 but task is not in state exiting
>>>>
>>>> One I zeroed out the messages file, things work again. I think some of
>>>> these nodes was taking parallel jobs before and the file system filled
>>>> up. So when someone submitted the serial jobs, it threw the queue into
>>>> error state?
>>>>
>>>>
>>>>
>>>> On Fri, Jul 9, 2010 at 10:49 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>>>>> Am 09.07.2010 um 17:25 schrieb kdoman:
>>>>>
>>>>>> Hi Reuti -
>>>>>> This is so odd! I don't recall my queue ever run into the error mode
>>>>>> until recently, and the only thing I implemented  recently was the
>>>>>> MPICH2 integration following your method.
>>>>>>
>>>>>> The error is very random. One of my clusters has around 2000 serial
>>>>>> jobs right now and last night almost 20% of the nodes ended up with
>>>>>> the error in the queue. I ran "qmod -c" to clear out the error and
>>>>>> this morning, some of the nodes had error again.
>>>>>
>>>>> You mean it's even failing when there are no MPICH2 jobs at all? All what you did were stating entries, which won't affect the network communication at all.
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> K.
>>>>>>
>>>>>> On Mon, Jul 5, 2010 at 4:43 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Am 04.07.2010 um 17:08 schrieb gqc606:
>>>>>>>
>>>>>>>> I installed SGE and MPICH2 on my computers,and integrated them with the following page: <http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html>
>>>>>>>>  First it can work well,everything is all right.But thirty hours later,I got some wrong messages in this directory on one of my computer nodes:
>>>>>>>>
>>>>>>>> [root at compute-0-0 ~]# cat /opt/gridengine/default/spool/compute-0-0/messages
>>>>>>>> 06/25/2010 22:26:28| main|compute-0-0|E|can't send asynchronous message to commproc (qmaster:1) on host "cluster.local": can't resolve host name
>>>>>>>> 06/25/2010 22:26:52| main|compute-0-0|E|commlib error: got select error (Connection reset by peer)
>>>>>>>
>>>>>>> this doesn't look like being connected to the MPICH2 setup, but like a NIS problem. All hostnames can be resolved on all machines? The spool directory is on a shared directory, or are these local on each machine?
>>>>>>>
>>>>>>> Only MPICH2 jobs are affected?
>>>>>>>
>>>>>>> -- Reuti
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I am confused,and don't know how to solve this problem.who can give me some advice?Thanks!
>>>>>>>>
>>>>>>>> ------------------------------------------------------
>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266025
>>>>>>>>
>>>>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------
>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266127
>>>>>>>
>>>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> <object type="application/x-shockwave-flash"
>>>>>> data="https://clients4.google.com/voice/embed/webCallButton"
>>>>>> width="230" height="85"><param name="movie"
>>>>>> value="https://clients4.google.com/voice/embed/webCallButton" /><param
>>>>>> name="wmode" value="transparent" /><param name="FlashVars"
>>>>>> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
>>>>>> /></object>
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266909
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>>
>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266915
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> <object type="application/x-shockwave-flash"
>>>> data="https://clients4.google.com/voice/embed/webCallButton"
>>>> width="230" height="85"><param name="movie"
>>>> value="https://clients4.google.com/voice/embed/webCallButton" /><param
>>>> name="wmode" value="transparent" /><param name="FlashVars"
>>>> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
>>>> /></object>
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266921
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266927
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>
>>
>>
>>
>> --
>> <object type="application/x-shockwave-flash"
>> data="https://clients4.google.com/voice/embed/webCallButton"
>> width="230" height="85"><param name="movie"
>> value="https://clients4.google.com/voice/embed/webCallButton" /><param
>> name="wmode" value="transparent" /><param name="FlashVars"
>> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
>> /></object>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=268560
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=268959
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>



--
<object type="application/x-shockwave-flash"
data="https://clients4.google.com/voice/embed/webCallButton"
width="230" height="85"><param name="movie"
value="https://clients4.google.com/voice/embed/webCallButton" /><param
name="wmode" value="transparent" /><param name="FlashVars"
value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
/></object>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=269035

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list