[GE users] cloning a cluster
isakrejda at lbl.gov
Mon Feb 9 18:55:10 GMT 2009
I had a problem described below and Reuti replied (I lost the e-mail but
it up in the archives). I am copying his reply for continuity:
>The spool directory is local on the qmaster node - in /var/spool/sge
>The daemons were started as root?
>$ ps -e f -o user,ruser,command | grep sge
>reuti reuti \_ grep sge
>sgeadmin root /usr/sge/bin/lx24-amd64/sge_qmaster
>sgeadmin root /usr/sge/bin/lx24-amd64/sge_schedd
That made me give a closer look to the sge_qmaster and I realized what the
problem was. My sge_master was older than my last attempt at cloning.
Here is what happened. In my running SGE I had a few nodes that everybody
forgot about. A few days earlier we cleaned up the /etc/hosts file and
removed nodes no longer in service. The backup procedure did not complain
about them, but cloning would not work. I had to go back, cleanup and backup
again. My second attempt at cloning worked.
The only problem was that the first attempt started the qmaster and did not
kill it when it failed, so when I tried cloning again, the sgemaster was
already there. The cloning process asks about cleaning /etc/init.d/
and would be nice if it could aso at least warn, if not ask and kill the
But it did not warn me and I ended up with a very confused qmaster.
When I noticed the old daemon, I killed it, cleaned up everything and
and then things got much better.
I wrote it up to clarify what happened to me and it might save some time
On 1/29/09 6:04 PM, isakrejda wrote:
> In spite of a question I had in my earlier e-mail, I got my clone up
> and running.
> I can querry its configuration and I see that it is what I wanted, but
> when I am
> trying to submit a job I get:
> pdsfcore01 53% qsub -b y /bin/date
> Unable to run job: error writing object "3000004" to spooling database
> job 3000004 was rejected cause it couldn't be written.
> pdsfcore01 54%
> My new cluster is ge6.2u1 and the sge master is running as sgeadmin
> and sgeadmin has permissions to write to the spool directory.
> Could you suggest what to look for?
> Thanks a lot,
> On 1/29/09 5:55 PM, isakrejda wrote:
>> On 1/29/09 4:11 PM, reuti wrote:
>>> Am 29.01.2009 um 22:17 schrieb isakrejda:
>>>> I am following the procedure in:
>>>> In step 9 I have a choice for selecting ports (set in scripts or via
>>>> My current ports are set in /etc/services and I do not want to disturb
>>>> my currently
>>>> running cluster so it seems to me that for the clone my only option is
>>>> to use the scripts method.
>>> I don't get what you mean by scripts method. You can answer with Y in
>>> the second question and enter any number you like there.
>> I have right now in /etc/services entries related to currently
>> operating batch system:
>> #cat /etc/services|grep sge
>> sge_commd 536/tcp
>> sge_commd 536/udp
>> sge_qmaster 537/tcp
>> sge_execd 539/tcp
>> I cannot enter into /etc/services same names with different port numbers
>> for my cloned cluster. I was able to get it up by using the first method
>> where ports are defined by environmental variables in the startup
>> I don't think entries for both the current and the cloned cluster can be
>> in etc services and if the can, could somebody explain how can it be
>> Thank You,
>>> -- Reuti
>>>> Is that correct or am I missing something?...
>>>> Thank You,
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users