[GE users] Problem of SGE

reuti reuti at staff.uni-marburg.de
Mon Jul 19 18:33:02 BST 2010


Am 19.07.2010 um 18:57 schrieb kdoman:

> I think I fixed it.
> 
> The mpich2_mpd/ owner/permission on the submit hosts was root:root
> 755, I changed it to sge:sge 755 and things are fine now! All the
> compute nodes already had sge:sge 755.
> 
> I hope it still make sense to you.

No. Unless any setuid is set, the ownership doesn't matter. And it's executable for everyone anyway AFAICS.

-- Reuti


> Thanks!
> K.
> 
> 
> On Mon, Jul 19, 2010 at 4:15 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>> Hi,
>> 
>> Am 17.07.2010 um 17:16 schrieb kdoman:
>> 
>>> Reuti -
>>> I think I got to the cause now, but still don't understand why:
>>> 
>>> I followed your method for the mpich2_mpd integration. After that, I
>>> sent the short submit script for users to use to submit their parallel
>>> jobs. We had one user who got so used to do things in the past where
>>> if he wants to kill his job, he had to manually go to the individual
>>> compute nodes to kill the mpd process. This time, instead of just
>>> simple using qdel, he still went into the compute nodes and kill the
>>> processes. Each time he does that, sge_qmaster process crashed.
>>> 
>>> Does this make any sense?
>> 
>> no, this is not supposed to happen. On the one hand, a `qdel` should be sufficient for a tightly integrated job to be removed in a clean way. On the other hand, killing any job on a node (i.e. its process) shouldn't crash the qmaster.
>> 
>> Does the user kill just the process on one of the parallel slave nodes and as a result the qmaster dies? The qmaster and execd are running under the root account? Is there anything in the qmaster messages file (or the one of the execd)?
>> 
>> -- Reuti
>> 
>> 
>>> On Fri, Jul 9, 2010 at 11:13 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>>>> Am 09.07.2010 um 17:59 schrieb kdoman:
>>>> 
>>>>> I take it back.
>>>>> 
>>>>> Looking into the node, I found the node's messages file at
>>>>> $SGE_ROOT/default/spool/node28/messages filled up 100% of the disk
>>>>> with these lines:
>>>> 
>>>> When the spool directory is shared between all nodes, it can indeed take the complete cluster down. Best is to have the spool directories local on each node e.g. at /var/spool/sge Appropriate subdirectories will be created automatically.
>>>> 
>>>> http://gridengine.sunsource.net/howto/nfsreduce.html
>>>> 
>>>> Besides this, there is a script /usr/sge/util/logchecker.sh You could also take the system standard logrotate, but the SGE supplied script will just discover the appropriate location whereever you messages file will be placed.
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> 
>>>>> 07/03/2010 09:35:29|  main|node28|W|get exit ack for pe task
>>>>> 1.compute-2-8 but task is not in state exiting
>>>>> 
>>>>> One I zeroed out the messages file, things work again. I think some of
>>>>> these nodes was taking parallel jobs before and the file system filled
>>>>> up. So when someone submitted the serial jobs, it threw the queue into
>>>>> error state?
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Jul 9, 2010 at 10:49 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>>>>>> Am 09.07.2010 um 17:25 schrieb kdoman:
>>>>>> 
>>>>>>> Hi Reuti -
>>>>>>> This is so odd! I don't recall my queue ever run into the error mode
>>>>>>> until recently, and the only thing I implemented  recently was the
>>>>>>> MPICH2 integration following your method.
>>>>>>> 
>>>>>>> The error is very random. One of my clusters has around 2000 serial
>>>>>>> jobs right now and last night almost 20% of the nodes ended up with
>>>>>>> the error in the queue. I ran "qmod -c" to clear out the error and
>>>>>>> this morning, some of the nodes had error again.
>>>>>> 
>>>>>> You mean it's even failing when there are no MPICH2 jobs at all? All what you did were stating entries, which won't affect the network communication at all.
>>>>>> 
>>>>>> -- Reuti
>>>>>> 
>>>>>> 
>>>>>>> K.
>>>>>>> 
>>>>>>> On Mon, Jul 5, 2010 at 4:43 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Am 04.07.2010 um 17:08 schrieb gqc606:
>>>>>>>> 
>>>>>>>>> I installed SGE and MPICH2 on my computers,and integrated them with the following page: <http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html>
>>>>>>>>>  First it can work well,everything is all right.But thirty hours later,I got some wrong messages in this directory on one of my computer nodes:
>>>>>>>>> 
>>>>>>>>> [root at compute-0-0 ~]# cat /opt/gridengine/default/spool/compute-0-0/messages
>>>>>>>>> 06/25/2010 22:26:28| main|compute-0-0|E|can't send asynchronous message to commproc (qmaster:1) on host "cluster.local": can't resolve host name
>>>>>>>>> 06/25/2010 22:26:52| main|compute-0-0|E|commlib error: got select error (Connection reset by peer)
>>>>>>>> 
>>>>>>>> this doesn't look like being connected to the MPICH2 setup, but like a NIS problem. All hostnames can be resolved on all machines? The spool directory is on a shared directory, or are these local on each machine?
>>>>>>>> 
>>>>>>>> Only MPICH2 jobs are affected?
>>>>>>>> 
>>>>>>>> -- Reuti
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I am confused,and don't know how to solve this problem.who can give me some advice?Thanks!
>>>>>>>>> 
>>>>>>>>> ------------------------------------------------------
>>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266025
>>>>>>>>> 
>>>>>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> ------------------------------------------------------
>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266127
>>>>>>>> 
>>>>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> <object type="application/x-shockwave-flash"
>>>>>>> data="https://clients4.google.com/voice/embed/webCallButton"
>>>>>>> width="230" height="85"><param name="movie"
>>>>>>> value="https://clients4.google.com/voice/embed/webCallButton" /><param
>>>>>>> name="wmode" value="transparent" /><param name="FlashVars"
>>>>>>> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
>>>>>>> /></object>
>>>>>>> 
>>>>>>> ------------------------------------------------------
>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266909
>>>>>>> 
>>>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>>> 
>>>>>> 
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266915
>>>>>> 
>>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> <object type="application/x-shockwave-flash"
>>>>> data="https://clients4.google.com/voice/embed/webCallButton"
>>>>> width="230" height="85"><param name="movie"
>>>>> value="https://clients4.google.com/voice/embed/webCallButton" /><param
>>>>> name="wmode" value="transparent" /><param name="FlashVars"
>>>>> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
>>>>> /></object>
>>>>> 
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266921
>>>>> 
>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>> 
>>>> 
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266927
>>>> 
>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> <object type="application/x-shockwave-flash"
>>> data="https://clients4.google.com/voice/embed/webCallButton"
>>> width="230" height="85"><param name="movie"
>>> value="https://clients4.google.com/voice/embed/webCallButton" /><param
>>> name="wmode" value="transparent" /><param name="FlashVars"
>>> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
>>> /></object>
>>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=268560
>>> 
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>> 
>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=268959
>> 
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>> 
> 
> 
> 
> -- 
> <object type="application/x-shockwave-flash"
> data="https://clients4.google.com/voice/embed/webCallButton"
> width="230" height="85"><param name="movie"
> value="https://clients4.google.com/voice/embed/webCallButton" /><param
> name="wmode" value="transparent" /><param name="FlashVars"
> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
> /></object>
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=269035
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=269037

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list