[GE users] Problem of SGE

reuti reuti at staff.uni-marburg.de
Mon Jul 19 10:15:04 BST 2010


Hi,

Am 17.07.2010 um 17:16 schrieb kdoman:

> Reuti -
> I think I got to the cause now, but still don't understand why:
> 
> I followed your method for the mpich2_mpd integration. After that, I
> sent the short submit script for users to use to submit their parallel
> jobs. We had one user who got so used to do things in the past where
> if he wants to kill his job, he had to manually go to the individual
> compute nodes to kill the mpd process. This time, instead of just
> simple using qdel, he still went into the compute nodes and kill the
> processes. Each time he does that, sge_qmaster process crashed.
> 
> Does this make any sense?

no, this is not supposed to happen. On the one hand, a `qdel` should be sufficient for a tightly integrated job to be removed in a clean way. On the other hand, killing any job on a node (i.e. its process) shouldn't crash the qmaster.

Does the user kill just the process on one of the parallel slave nodes and as a result the qmaster dies? The qmaster and execd are running under the root account? Is there anything in the qmaster messages file (or the one of the execd)?

-- Reuti


> On Fri, Jul 9, 2010 at 11:13 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>> Am 09.07.2010 um 17:59 schrieb kdoman:
>> 
>>> I take it back.
>>> 
>>> Looking into the node, I found the node's messages file at
>>> $SGE_ROOT/default/spool/node28/messages filled up 100% of the disk
>>> with these lines:
>> 
>> When the spool directory is shared between all nodes, it can indeed take the complete cluster down. Best is to have the spool directories local on each node e.g. at /var/spool/sge Appropriate subdirectories will be created automatically.
>> 
>> http://gridengine.sunsource.net/howto/nfsreduce.html
>> 
>> Besides this, there is a script /usr/sge/util/logchecker.sh You could also take the system standard logrotate, but the SGE supplied script will just discover the appropriate location whereever you messages file will be placed.
>> 
>> -- Reuti
>> 
>> 
>>> 
>>> 07/03/2010 09:35:29|  main|node28|W|get exit ack for pe task
>>> 1.compute-2-8 but task is not in state exiting
>>> 
>>> One I zeroed out the messages file, things work again. I think some of
>>> these nodes was taking parallel jobs before and the file system filled
>>> up. So when someone submitted the serial jobs, it threw the queue into
>>> error state?
>>> 
>>> 
>>> 
>>> On Fri, Jul 9, 2010 at 10:49 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>>>> Am 09.07.2010 um 17:25 schrieb kdoman:
>>>> 
>>>>> Hi Reuti -
>>>>> This is so odd! I don't recall my queue ever run into the error mode
>>>>> until recently, and the only thing I implemented  recently was the
>>>>> MPICH2 integration following your method.
>>>>> 
>>>>> The error is very random. One of my clusters has around 2000 serial
>>>>> jobs right now and last night almost 20% of the nodes ended up with
>>>>> the error in the queue. I ran "qmod -c" to clear out the error and
>>>>> this morning, some of the nodes had error again.
>>>> 
>>>> You mean it's even failing when there are no MPICH2 jobs at all? All what you did were stating entries, which won't affect the network communication at all.
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> K.
>>>>> 
>>>>> On Mon, Jul 5, 2010 at 4:43 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> Am 04.07.2010 um 17:08 schrieb gqc606:
>>>>>> 
>>>>>>> I installed SGE and MPICH2 on my computers,and integrated them with the following page: <http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html>
>>>>>>>  First it can work well,everything is all right.But thirty hours later,I got some wrong messages in this directory on one of my computer nodes:
>>>>>>> 
>>>>>>> [root at compute-0-0 ~]# cat /opt/gridengine/default/spool/compute-0-0/messages
>>>>>>> 06/25/2010 22:26:28| main|compute-0-0|E|can't send asynchronous message to commproc (qmaster:1) on host "cluster.local": can't resolve host name
>>>>>>> 06/25/2010 22:26:52| main|compute-0-0|E|commlib error: got select error (Connection reset by peer)
>>>>>> 
>>>>>> this doesn't look like being connected to the MPICH2 setup, but like a NIS problem. All hostnames can be resolved on all machines? The spool directory is on a shared directory, or are these local on each machine?
>>>>>> 
>>>>>> Only MPICH2 jobs are affected?
>>>>>> 
>>>>>> -- Reuti
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> I am confused,and don't know how to solve this problem.who can give me some advice?Thanks!
>>>>>>> 
>>>>>>> ------------------------------------------------------
>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266025
>>>>>>> 
>>>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>>> 
>>>>>> 
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266127
>>>>>> 
>>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> <object type="application/x-shockwave-flash"
>>>>> data="https://clients4.google.com/voice/embed/webCallButton"
>>>>> width="230" height="85"><param name="movie"
>>>>> value="https://clients4.google.com/voice/embed/webCallButton" /><param
>>>>> name="wmode" value="transparent" /><param name="FlashVars"
>>>>> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
>>>>> /></object>
>>>>> 
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266909
>>>>> 
>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>> 
>>>> 
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266915
>>>> 
>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> <object type="application/x-shockwave-flash"
>>> data="https://clients4.google.com/voice/embed/webCallButton"
>>> width="230" height="85"><param name="movie"
>>> value="https://clients4.google.com/voice/embed/webCallButton" /><param
>>> name="wmode" value="transparent" /><param name="FlashVars"
>>> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
>>> /></object>
>>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266921
>>> 
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>> 
>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266927
>> 
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>> 
> 
> 
> 
> -- 
> <object type="application/x-shockwave-flash"
> data="https://clients4.google.com/voice/embed/webCallButton"
> width="230" height="85"><param name="movie"
> value="https://clients4.google.com/voice/embed/webCallButton" /><param
> name="wmode" value="transparent" /><param name="FlashVars"
> value="id=bca66786587a81c2f3e9fae17f7b9c1bd2918718&style=0"
> /></object>
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=268560
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=268959

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list