[GE users] cloning a cluster

isakrejda isakrejda at lbl.gov
Mon Feb 9 19:35:21 GMT 2009


So now I have a qmaster that is talking to me and I adjusted the 
configuration.
I also ran # ./inst_sge -upd-execd and initialized the local spool 
directories
(and verified; they are there).

I skipped ## ./inst_sge -upd-rc because it comes with a warning :
"    Caution This command removes old RC scripts. To keep the old
RC scripts, do not run this command. " I'll run it when I am ready
to switch from the current to the newly cloned one.

Now it's time to start the cluster. And I have to questions.

There is an sge_qmaster started during the cloning process. Should I stop it
before issuing "# ./inst_sge -start-all"?
Also I presume this command will start sgexecd daemons on the execusion 
hosts.
I understood from the writeup that they will not interfere with the
"production" daemons (I did define different ports for the clone).

This is the first time I am doing cloning and I do not want to upset
at this point my production setup.

Thanks a lot,

iwona



On 2/9/09 10:55 AM, isakrejda wrote:
> Hi,
>
> I had a problem described below and Reuti replied (I lost the e-mail 
> but looked
> it up in the archives). I am copying his reply for continuity:
>
> >The spool directory is local on the qmaster node - in /var/spool/sge
> >of $SGE_ROOT/default?
>
> >The daemons were started as root?
>
> >$ ps -e f -o user,ruser,command | grep sge
> >reuti reuti \_ grep sge
> >sgeadmin root /usr/sge/bin/lx24-amd64/sge_qmaster
> >sgeadmin root /usr/sge/bin/lx24-amd64/sge_schedd
>
> That made me give a closer look to the sge_qmaster and I realized what 
> the
> problem was. My sge_master was older than my last attempt at cloning.
>
> Here is what happened. In my running SGE I had a few nodes that everybody
> forgot about. A few days earlier we cleaned up the /etc/hosts file and
> removed nodes no longer in service. The backup procedure did not complain
> about them, but cloning would not work. I had to go back, cleanup and 
> backup
> again. My second attempt at cloning worked.
>
> The only problem was that the first attempt started the qmaster and 
> did not
> kill it when it failed, so when I tried cloning again, the sgemaster was
> already there. The cloning process asks about cleaning /etc/init.d/ 
> startup file
> and would be nice if it could aso at least warn, if not ask and kill 
> the sgemaster.
> But it did not warn me and I ended up with a very confused qmaster.
>
> When I noticed the old daemon, I killed it, cleaned up everything and 
> re-cloned
> and then things got much better.
>
>
> I wrote it up to clarify what happened to me and it might save some 
> time somebody else...
>
> Iwona
>
> On 1/29/09 6:04 PM, isakrejda wrote:
>> Hi,
>>
>> In spite of a question I had in my earlier e-mail, I got my clone up 
>> and running.
>> I can querry its configuration and I see that it is what I wanted, 
>> but when I am
>> trying to submit a job I get:
>> pdsfcore01 53% qsub -b y /bin/date
>> Unable to run job: error writing object "3000004" to spooling database
>> job 3000004 was rejected cause it couldn't be written.
>> Exiting.
>> pdsfcore01 54%
>>
>> My new cluster is ge6.2u1 and the sge master is running as sgeadmin
>> and sgeadmin has permissions to write to the spool directory.
>>
>> Could you suggest what to look for?
>>
>> Thanks a lot,
>>
>> iwona
>>
>>
>> On 1/29/09 5:55 PM, isakrejda wrote:
>>>
>>>
>>> On 1/29/09 4:11 PM, reuti wrote:
>>>> Hi,
>>>>
>>>> Am 29.01.2009 um 22:17 schrieb isakrejda:
>>>>
>>>>   
>>>>> I am following the procedure in:
>>>>> http://wikis.sun.com/display/GridEngine/Example+Upgrade+for+Cloned 
>>>>> +Cluster+Configuration
>>>>>
>>>>> In step 9 I have a choice for selecting ports (set in scripts or via
>>>>> /etc/services).
>>>>> My current ports are set in /etc/services and I do not want to disturb
>>>>> my currently
>>>>> running cluster so it seems to me that for the clone my only option is
>>>>> to use the scripts method.
>>>>>     
>>>>
>>>> I don't get what you mean by scripts method. You can answer with Y in  
>>>> the second question and enter any number you like there.
>>>>   
>>> I have right now in /etc/services entries related to currently 
>>> operating batch system:
>>> #cat /etc/services|grep sge
>>> sge_commd       536/tcp
>>> sge_commd       536/udp
>>> sge_qmaster     537/tcp
>>> sge_execd       539/tcp
>>>
>>> I cannot enter into /etc/services same names with different port numbers
>>> for my cloned cluster. I was able to get it up by using the first 
>>> method
>>> where ports are defined by environmental variables in the startup 
>>> scripts.
>>>
>>> I don't think entries for both the current and the cloned cluster 
>>> can be
>>> in etc services and if the can, could somebody explain how can it be 
>>> done?
>>>
>>> Thank You,
>>>
>>> iwona
>>>
>>>
>>>
>>>> -- Reuti
>>>>
>>>>
>>>>   
>>>>> Is that correct or am I missing something?...
>>>>>
>>>>> Thank You,
>>>>>
>>>>> Iwona
>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>>>> dsForumId=38&dsMessageId=100385
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users- 
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>     
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=100423
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=102989

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list