[GE users] Using cycles from a 2nd SGE cluster

rhierlmeier richard.hierlmeier at sun.com
Mon Jan 11 09:55:01 GMT 2010


Hi Joe,


On 01/11/10 08:16, Joe Izen wrote:
> Richard,
>    Don and I compares notes this morning and we have a bunch of
> question, but first I'd appreciate it if you could confirm that I've
> correctly understood the big picture from your email.
>
> Physically, we have two independent clusters, each with its own master
> node: fester and cosmo that server to user communities at our
> university, High Energy Physics (HEP) and Cosmology (C). The fester
> cluster has 19 x dual quad core batch worker nodes. The cosmo cluster
> has 13 x dual quad core batch worker nodes. Naturally each community
> wants their own batch jobs to have priority on their own hardware,
> however C has consented to share cosmo's batch workers if they are
> idling.  Since HEP jobs potentially can run for hours, it is important
> that some of cosmo's worker nodes be permanently restricted to C jobs.
>
> If we run a Service Domain Manager, I think we give up thinking about
> the 19+13 nodes as members of independent clusters, but as resources
> that can be assigned either to a HEP or C cluster according to the
> SLO's.  If fester and cosmo are idling, then:
>
>
> https://www.sun.com/offers/docs/Sun_Grid_Engine_62_install_and_config.pdf
>
> suggests we define a PermanentRequestSLO with low urgency so that the
> worker resources are returned to their "home: clusters.
>
> At present fester's file systems are nsf-mounted by cosmo's workers, but
> the reverse isn't true. Consequently, HEP jobs can execute on the cosmo
> nodes but most C jobs would fail on fester nodes.
>
> cosmo nodes (up to some maximum number, say 10) can be assigned to the
> HEP cluster
> several cosmo nodes (say 3) are always assigned to the C cluster
> fester nodes are always assigned to the HEP cluster
>
> I anticipate that hard drives in the fester worker nodes will shortly be
> loaded with data that will occasionally be used by the fester master
> node, so the fester nodes should not be turned off. I'd have to check
> with the owners of cosmo, but it might be acceptable to power-down /
> sleep the cosmo worker nodes.
>
> Do I have the SGE jargon and the big SGE picture right?

Yes, you have it.

>
>
> Questions:
>
> Where does the SDM run? Can it run one of the master nodes, fester or
> cosmo? Can it run on a dedicated 3rd machine?

SDM is a component based system. The components are living inside of java
virtual machines (JVM). For your use case you will need the following components:

   o Configuration Service (provides the configuration of the system)
   o Resource Provider (scheduler of the SDM system)
   o SGE service for HEP cluster (needs file access to SGE_ROOT)
   o SGE service for Cosmo cluster (need file access to SGE_ROOT)
   o Executors (used to execute scripts on the hosts, one instance one any
     machine that participates on the SDM system)
   o Optional: spare_pool (collects idle resources, may be power off feature)

Configuration Service, Resource Provider and spare_pool runs in a central JVM on
a central host (= SDM master host). This central host can also be a SGE master
host, it can also be a dedicated host.
The SGE service must run in a JVM on a host that has access to the SGE_ROOT
directory (reads the act_qmaster file to support the SGE fail over concept).
Normally this is the SGE master host.

Each worker host needs one JVM that hosts an Executor component.

> Sometimes, the owners of
> fester or cosmo take their machine offline for maintenance, but we
> wouldn't want to hobble the other user community.
>

In this case please use a dedicated SDM master host. Use also a spare_pool that
collects the idle hosts. If a master host of SGE service goes down all hosts
which are assigned to the service are no longer available for the SDM system.
All resources of the stopped cluster that are in spare_pool can still be used by
the other cluster.

> How are users handled in SGE?

SGE does not have it's own user management. It uses operating system features.

> In general, the HEP physicists don't have
> accounts on cosmo (and C physicists don't have accounts on fester).
> Would this become moot, because when a cosmo node is assigned to the HEP
> cluster, it inherits the user list/password/approved certificate list of
> the HEP cluster?
>

You can define in the configuration of a SGE service scripts that will be
executed once a resource (=host) is assigned or moved away to/from a service
(=SGE cluster). Within such a script you can reconfigure your user database
(different NIS server, LDAP, what ever you want).

> Do fester and cosmo have to run identical versions of Linux in order to
> contribute workers to the same HEP cluster managed by a SDM?

No.

> Naturally
> the executing jobs would need to run in both OS versions, but does the
> SDM impose its own requirement?

SDM is written in java. The only requirement is a Java 6 Runtime Environment.

>
> Our usage example is not so complicated - it must be fairly common. For
> starters, the urgency of a job need not take into account the wait time
> or a deadline. Is there a sample of SLO's for a SDM that we can study
> and use as a starting point for our own?  If not, it would be a big help
> if you could show us a skeleton of what we need.

You will need a combination of MaxPendingJobsSLO, PermanentRequestSLO and
MinResourceSLO.

Each SDM resource (represents a host in SDM) will have a property with name
owner. For HEP resources the value of this property will be "HEP", for C
resources the value will be "Cosmo".

You need for each SGE service

o MinResourceSLO with the highest urgency (=99) that guarantees that at
   least a number of resources stay in the cluster (Alternatively you can also
mark a set of resources as static, static resources are never moved by SDM).

o A MaxPendingJobsSLO that requests resources that originally belongs to the
cluster with urgency 90.

o A MaxPendingJobsSLO that requests resources that originally belong to the
foreign cluster with urgency 50


For the spare_pool you need a PermanentRequestSLO with the lowest urgency (e.g. 1)

Sample SLO setup for HEP SGE service:

<common:componentConfig>
  ...
  <common:slos>
     <common:slo xsi:type="MinResourceSLOConfig"
         name="min_res" urgency="99" min="3">
         <common:request> owner = "HEP" </common:request>
     </common:slo>
     <common:slo xsi:type="MaxPendingJobsSLOConfig"
          name="mpj_for_own_res" max="1" urgency="90">
         <common:request> owner = "HEP" </common:request>
     </common:slo>
     <common:slo xsi:type="MaxPendingJobsSLOConfig"
          name="mpj_for_foreign_res" max="1" urgency="50">
    </common:slo>
  </common:slos>
  ...
</common:componentConfig>

Sample SLO setup for the spare_pool:

<common:componentConfig>
    ...
    <common:slo xsi:type="PermanentRequestSLOConfig"
          name="perm_for_idle_res" urgency="10">
         <common:request> owner = "HEP" </common:request>
    </common:slo>
    ...
</common:componentConfig>

Unfortunately we do not have an out of the box SLO that considers the deadline
of a job. If I understood you correctly such an SLO should produce a high
urgency resource request for a host once the deadline comes nearer. Sounds like
a interresting use case. Let us come together to formulate the requirements for
such an SLO.

>
> Thanks very much!

You are welcome

   Richard


> Joe
>
>
>
> At 4:14 PM +0100 1/8/10, Richard Hierlmeier wrote:
>> Don wrote:
>>> I haven't used subordinate queues before- in sge. In PBSpro,
>>> I used routing queues to between two master-nodes. I suppose
>>> you could think of all queues as subordinate to the route
>>> queue. Although - spare-cyle jobs should be subordinate to
>>> the owners job mix - so this is useful to think about at
>>> at later time.
>>>
>>> Next- he's explaining exactly what we've already done and
>>> talked about. IN this scenario - all slaves would only
>>> run SDM and sgeexecd would be removed at install time.
>>
>> sgeexecd is installed on a node of SDM moves the host into the
>> cluster. The execd is uninstalled if SDM removes the host from
>> cluster. The installation of the execd is only a matter of seconds.
>>
>> However the cluster has a problem if the host is part of an advanced
>> reservation. The reservation will fail. SDM does not give any
>> guarantees that the host will come back in time.
>>
>>> Each master node would run independantly.
>>
>> This is the big advantage of the SDM solution. The configuration of
>> the two clusters are completely independent.
>>
>>> I don't see
>>> any way to prioritize the queues
>>
>> You can not prioritize queues with SDM, SDM does not know a queue. But
>> you can gives certain kind of jobs a higher urgency. For each job
>> category the corresponding SDM service will have a MaxPendingJobsSLO.
>> The SLO will only produce a resource request if pending jobs belonging
>> to this job category are available. The resource request with the
>> highest urgency will win. The cluster will get more resources (=host).
>>
>>> or tie the resources
>>> of one cluster to it's own master -beyond just naming
>>> in sdm.
>>
>> In SDM you can have static resources. SDM will never move such a
>> static host away from the cluster.
>>
>>> SLO is complex mess to deal with - very little
>>> guidance on how to do much with it beyond basics.
>>
>> If you need any help with the SLO setup your are welcome.
>>
>>> I'm not clear about how the nodes are powered off and
>>> on by sdm - and what associated hw,firmware is involved
>>> in this - could be that's only a case for blade,chassis,
>>> or only the latest hardware, ie idrac.
>>
>> SDM 1.0u5 comes with a power saving solution that turns of host via
>> IPMI. However this is not hard coded. SDM has a well defined scripting
>> interface for power saving. If you can power off your host remotely
>> from the command line power saving with SDM will be possible.
>>
>>> The install/uninstall of sgeexecd is rather useless work
>>
>> I agree, SDM automates this task for you.
>>
>>> - I don't understand
>>> why a static pool cannot be designated spare .
>>>
>>> acluster sgemaster
>>> bcluster sgemaster
>>>
>>> sparepool aslave1..20 bslave1..13
>>
>> I do not really understand what a static pool is. Is it a SDM
>> spare_pool or a subordinate queue?
>>
>>>
>>> all OS and sw would need to be consistant enough for
>>> any job from each owner mix. And the user-acct need
>>> to be same for the shared job, usatlas1. I suppose if a user
>>> was not defined in one pool - but not the other,
>>> the job would fail, error. Not sure if sge can
>>> deal with a two indpendant uid/gid.
>>>
>>
>> SDM supports different kind of resources. Categorize your jobs and
>> setup for each job category a SLO. The SLO will request the needed hosts.
>>
>>
>> Richard
>>
>>> -/Don
>>>
>>> On Fri, 8 Jan 2010, Richard Hierlmeier wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> izen wrote:
>>>>> My sys admin has been trying to configure two independent, linux
>>>>> clusters with static SGE pools, such that when the first cluster
>>>>> batch queue fills, additional jobs will fall over to a low priority
>>>>> queue in the second cluster. Each cluster has its own master node,
>>>>> and it would be a political non-starter to change that. So far, my
>>>>> admin has not succeeded. Is his configuration with static pools
>>>>> workable?
>>>>
>>>> I think you are talking about subordinate queues. Not it is not
>>>> doable with this feature. Sounds more like a use case for the
>>>> Service Domain Manager (SDM) module of SGE.
>>>>
>>>>> If so, we would welcome some guidance in configuring our SGE
>>>>> deployment to do this.
>>>>
>>>> SDM implements resource sharing between two or more SGE clusters.
>>>> For each SGE cluster a SLO (Service Level Objective) can be defined.
>>>> This SLO will request new hosts whenever there are jobs in the
>>>> pending queue. SDM takes hosts out of spare_pool and installs the
>>>> execd of the cluster on it. Once workload goes down the hosts are
>>>> removed from the SGE cluster and put back to the spare_pool.
>>>>
>>>> You can implement power saving (hosts in spare_pool can be powered
>>>> off) with SDM. In addition you can get hosts from a cloud service
>>>> like EC2.
>>>>
>>>> For a good introduction please have a look at
>>>>
>>>> http://www.youtube.com/watch?v=kFrwOdAVxJI
>>>>
>>>>
>>>> Richard
>>>>
>>>>>
>>>>> We are beginning to wonder whether this is undoable with static
>>>>> pools, and need to switch to a dynamic pool.
>>>>>
>>>>> Input would be most welcome.  Thanks!  -Joe
>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=237301
>>>>>
>>>>>
>>>>> To unsubscribe from this discussion, e-mail:
>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>
>>>>
>>>> --
>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>> - - - -
>>>> Richard Hierlmeier           Phone: ++49 (0)941 3075-223
>>>> Software Engineering         Fax:   ++49 (0)941 3075-222
>>>> Sun Microsystems GmbH
>>>> Dr.-Leo-Ritter-Str. 7         mailto: richard.hierlmeier at sun.com
>>>> D-93049 Regensburg           http://www.sun.com/grid
>>>>
>>>> Sitz der Gesellschaft:
>>>> Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
>>>> Amtsgericht Muenchen: HRB 161028
>>>> Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
>>>> Vorsitzender des Aufsichtsrates: Martin Haering
>>>>
>>
>>
>> --
>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>> - - -
>> Richard Hierlmeier           Phone: ++49 (0)941 3075-223
>> Software Engineering         Fax:   ++49 (0)941 3075-222
>> Sun Microsystems GmbH
>> Dr.-Leo-Ritter-Str. 7         mailto: richard.hierlmeier at sun.com
>> D-93049 Regensburg           http://www.sun.com/grid
>>
>> Sitz der Gesellschaft:
>> Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
>> Amtsgericht Muenchen: HRB 161028
>> Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
>> Vorsitzender des Aufsichtsrates: Martin Haering
>


--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Richard Hierlmeier           Phone: ++49 (0)941 3075-223
Software Engineering         Fax:   ++49 (0)941 3075-222
Sun Microsystems GmbH
Dr.-Leo-Ritter-Str. 7        mailto: richard.hierlmeier at sun.com
D-93049 Regensburg           http://www.sun.com/grid

Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=238043

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list