[GE users] Newbie questions

Dan Gruhn Dan.Gruhn at Group-W-Inc.com
Wed Apr 6 16:52:31 BST 2005


Dale,

Here is a fairly short proceedure that could get you going.  It has some
specifics for our installation, but should be helpful.  Try it on an SGE
master and a couple of nodes just to start with.  Rueti's questions are
great to help you decide how you want to do things.

Dan

---------------------------------------
1.1. Introduction
This document describes what a computing grid is, and how the setup and
configuration of the Sun Grid Engine (SGE) can be accomplished.  It is
intended as an example of how to set up SGE and is not a replacement for
the Install, Administration and User Guide's that Sun has available.
1.1.1. What Is A Computing Grid?
A computing grid is a collection of hosts, a cluster computer, a
multiprocessor host, or any combination of the above tied together with
software (e.g., SGE) to make available to the user the execution of jobs
on CPUs in the grid.
The power of grid computing comes from being able harness idle time on
hosts in the grid as well as making easily available computing resources
that are not directly at a user's fingertips.  Additionally, the queuing
capability of the grid software makes background execution of a task
much easier, even on a single machine.
Usually a system administrator is needed to configure the grid and sets
up the job queues used and the rules governing which users can use what
resources.
1.2. Sun Grid Engine (SGE) Documentation
Sun's documentation can be found by searching the list found at the
location http://docs.sun.com/app/docs/titl.  Search by title for "N1
grid engine".  At a minimum the Installation, Administration, and User's
Guides should be downloaded.
1.3. Downloading SGE Software
SGE software can be downloaded from the Grid Engine website at
http://gridengine.sunsource.net/.  Under "Resources" on the left side of
the page, click on Download SGE 6.0 Binaries.  Follow the instructions
on the page to read and accept the license agreement and find the proper
files for the O/S and host types to download.  It will be in a *.tar.gz
format.
1.4. Installing SGE
1.4.1. Planning
Several decisions must be made before planning an installation:
"	Decide whether the system of networked computer hosts that run N1 Grid
Engine 6 software (grid engine system) is to be a single cluster or a
collection of sub-clusters, called cells. Cells allow for separate
instances of the grid en-gine software but share the binary files across
those instances.
"	Select the machines that are to be grid engine system hosts. Determine
the host type of each machine: master host, shadow master host,
administration host, submit host, execution host, or a combination.
"	Ensure that all users of the grid engine system have the same user
names on all submit and execution hosts.
"	Decide how to organize grid engine software directories. For example,
they could be organized as a complete tree on each workstation, or as
cross-mounted directories, or as a partial directory tree on some
workstations. Also decide where to locate each grid engine software
installation directory, SGE_ROOT.
"	Decide on the site's queue structure.
"	Determine whether to define network services as an NIS file or as
local to each workstation in /etc/services.
Chapter 1 of the Installation Guide is very helpful in this process.  It
discusses each of the areas mentioned above as well as disk space
requirements.
As an example for this document, here is how the above decisions were
made:
"	A single cluster under the default name "default".
"	It has a single Master Host and no Shadow Master Hosts.  All remaining
ma-chines are Administration, Submit, and Execution hosts.
"	All users of the grid engine have the same user name on all machines
through the use of an LDAP server.
"	The grid engine software directories are mountable via NFS (the
network file system) in the same location by any machine in the cluster
as /direct/sgeadmin/SunGridEngine.  This directory physically resides on
the Master Host.
"	The site's queue structure divides the available machines into 3
classes: 
o	high.q for the highest speed, dual processor hyper-threaded Xeon 3.2
GHz, 1.5 GByte machines.
o	mid.q for the mid-range speed, single processor, hyper-threaded
Pen-tium 4 2.8 GHz, 1GByte machines.
o	low.q for the lowest speed, single processor, non-hyper-threaded
Pen-tium 4 1.8 GHz, 512 MByte machines.
For special purposes, functionally based queues are created from the
same pool of machines.  For example, some machines may be used in the
mysql.q job queue to do database loading while others are used in the
xeq.q for run-ning the model.  Care must be taken when hosts are in
multiple queues be-cause SGE could overload a host if overlapping queues
are used simultane-ously.
"	Network services are defined on each machine via the /etc/services
file as follows.  Different service numbers must be used, but they MUST
be the same on all machines.  (Note that the UDP values are not used by
SGE.)
# Local services
sge_command     460/tcp     # Sun Grid Computing command port
sge_command     460/udp     # Sun Grid Computing command port
sge_qmaster     461/tcp     # Sun Grid Computing queue master port
sge_qmaster     461/udp     # Sun Grid Computing queue master port
sge_execd       462/tcp     # Sun Grid Computing xeq port
sge_execd       462/udp     # Sun Grid Computing xeq port

1.4.2. Needed Packages
For the GUI of SGE to work, the openmotif21-2.1.30-8 or greater package
must be installed.
1.4.3. Master Host
To install a Master Host, follow the procedure "How to Install the
Master Host" in Chapter 2 of the Installation Guide.  
The following will help while stepping through the install_qmaster
command interac-tion:
"	Choose a name for the SGE administrator, sgeadmin is a good choice. 
This pseudo user must be created before beginning the installation.
"	The SGE_ROOT environment variable should be set to
/home/sgeadmin/SunGridEngine or wherever the software was untarred.
"	Make sure the /etc/services file is updated with the correct local
services be-fore beginning.
"	Create a file containing the names of the hosts that are to be the
execution hosts.  This file will be used later in this procedure.
"	Keep the default cell name of "default" unless planning for multiple
cells.
"	The default spool directory is fine for most setups.
"	When asked file permissions have been set, enter "n" and then enter
"y" to verifying and setting file permissions.
"	Answer questions about the DNS names of the grid engine system hosts.
"	For most installations, choose to use Berkley DB spooling, but without
a separate spooling server.
"	For a group ID range, 2000-2100 is a reasonable value as long as:
1.	Established groups do not already extend into this range; and 
2.	It is not likely to have more the 100 grid engine jobs on the same
host at the same time.
If more than 100 grid engine jobs are needed, increase the end of the
range.
"	The default spool directory for the execution hosts is usually fine. 
Give a local directory for speed of execution during Execution Host
configuration.
"	Choose a person who can receive email in cases of problems.  The email
of the person doing this installation is a good choice.
"	Verify that the configuration parameters look right when asked to do
so by the installation script.
"	Request that the qmaster/scheduler startup script be run at startup
time.
"	When asked to specify the execution hosts, use the previously created
file listing the host names.
"	For a scheduler profile, Normal is acceptable unless the installation
is creat-ing a very high performance system needing to service many
users and many different types of jobs.
Once past this question, the installation process is complete. Several
screens of in-formation will be displayed before the script exits. The
commands that are noted in those screens are also documented in Chapter
2 of the Installation Guide.
At the end of the Master Host installation procedure, it talks about the
settings.csh and settings.sh files.  It will be very helpful to change
the .profile, .bash_profile, .login or equivalent startup script for the
root account on each machine that will be part of the computing grid to
include execution of the appropriate script.  Doing so will mean that
all of the proper environmental variables will be set up when using the
root ac-count.  Each user of the grid engine should also add the same
thing for his or her startup script.
1.4.4. Execution Hosts
As with installing a Master Host, use Chapter 2, "How to Install
Execution Hosts" from the Installation Guide. The following assist in
making decisions:
"	The master server MUST be installed before beginning this procedure.
"	The SGE_ROOT environment variable should be set to
/home/sgeadmin/SunGridEngine or wherever the software was untarred.
"	Make sure the /etc/services file on each Execution Host is updated
with the SGE local services before beginning the installation.
"	Run the install_execd command as root.
"	Make sure that the SGE_ROOT directory is correct, that it is the same
as where the SGE tarball was installed, and was used for the
Administration Host installation.  Alternately, put a local copy of the
SGE_ROOT directory on each local host to cut down on NFS traffic.  Note
that separate directories on each execution host necessitate additional
manual configuration to keep them synchronized as things changes.
"	The default cell name of "default" should be sufficient unless there
are hosts that cannot directly communicate with each other.
"	Use a local spool directory to keep NFS traffic to a minimum. 
/var/spool/sge is a reasonable choice.  This directory needs to be
created outside of the pro-ceedure and needs to be owned by sgeadmin.
"	Have execd startup automatically at boot time.
"	Add the default queue instance for the host.  It will help for testing
before cre-ating queues specific to the needs of the users.
Once past the last question, the installation process is complete.
Several screens of information will be displayed before the script
exits. The commands that are noted in those screens are also documented
in Chapter 2 of the Installation Guide.
At the end of the Execution Host installation procedure, it talks about
the set-tings.csh and settings.sh files.  It will be very helpful to
change the .profile, .bash_profile, .login or equivalent startup script
for each user that will use the com-puting grid to include execution of
the appropriate script.  Doing so will mean that all of the proper
environmental variables will be set up when using the grid.  
1.4.5. Administration Hosts
For each execution host that should also allow administration of the
computing grid, follow the procedure "Registering Administration Hosts"
in Chapter 2 of the Installa-tion Guide.  Basically run the following
command as root:
			qconf -ah admin_host_name[,...]

1.4.6. Submit Hosts
For each execution host that should also allow submission of jobs to the
computing grid, should follow the procedure "Registering Submit  Hosts"
in Chapter 2 of the In-stallation Guide.  Basically run the following
command as root:
			qconf -as submit_host_name[,...]
1.4.7. Verify the Installation
Using Chapter 6 of the Installation Guide, verify that the installation
is up and run-ning.  Follow the "How To Verify That Daemons Are Running
On The Master Host", and "How To Verify That The Daemons Are Running On
The Execution Hosts" pro-cedures.
Once this all works, try submitting one of the sample scripts contained
in the $SGE_ROOT/examples/jobs directory.  For example:
> qsub sge-root/examples/jobs/simple.sh
Use the qstat command to monitor the job's behavior.
For more information about submitting and monitoring batch jobs, see
Submitting Batch Jobs in chapter 3 of the N1 Grid Engine 6 User's Guide.
After the job finishes executing, check the home directory for the
redirected stdout/stderr files script-name.ejob-id and
script-name.ojob-id.
job-id is a consecutive unique integer number assigned to each job.
In case of problems, see Chapter 8, Fine Tuning, Error Messages, and
Trouble-shooting, in N1 Grid Engine 6 Administration Guide.
1.5. Configuration Users
1.5.1. Users
The simplest way to configure users is to use user based equal sharing
of re-sources, with automatic registration of users.  To do this,
configure the cluster global configuration (see sge_conf(5)) with the
following:
enforce_user	auto
auto_user_fshare	100

Using the qmon GUI configuration tool do the following:
A.	click the "Cluster Configuration" button, select "global" in the left
column and click the "Modify" button.
B.	In the General Settings tab, look for the "Automatic User Defaults
area at the lower right, set "Functional Shares" to 100.
C.	Just above that, set "Enforce User" to "Auto" ("Enforce Project"
should be "False"). 
Next, configure the scheduler configuration (see sched_conf(5)) with the
following:
weight_tickets_functional	10000

Again using qmon, click on "Policy Configuration" button.  In the
"Ticket Policy" sec-tion, set "Total Functional Tickets" to 10000.
This will result in having each user automatically registered in the
computing grid when they submit a job, and each user having equal access
to grid resources.  That is, if Bob and Wanda both submit jobs, barring
any other constraints, they will share the currently available computing
resources equally.  View the currently registered set of users by using
qmon and clicking on the "User Configuration" button, and se-lecting the
User  tab.
1.5.2. Managers
Managers can perform any operation the Grid Engine is capable of
performing.  To configure users who have manager privileges for the
grid, use qmon and click on the "User Configuration" button. Under the
Manager tab, enter the names of users who will be managers and click the
"Add" button.  See chapter 4 of the N1 Grid Engine 6 Administration
Guide for details on what managers can do.
1.5.3. Operators
Operators have more privileges than simple users, but less than
managers.  Use the Operator tab in the "User Configuration" screen to
enter operators. .  See chapter 4 of the N1 Grid Engine 6 Administration
Guide for details on what operators can do.
1.6. Configuring Job Queues
Queues are containers for different categories of jobs. Queues provide
the corre-sponding resources for concurrent execution of multiple jobs
that belong to the same category.
In SGE, a queue can be associated with one host or with multiple hosts.
Because queues can extend across multiple hosts, they are called cluster
queues. Cluster queues enable managing a cluster of execution hosts by
means of a single cluster queue configuration and name.
Each host that is associated with a cluster queue receives an instance
of that cluster queue, which resides on that host. These instances are
known as queue instances. Within any cluster queue, each queue instance
can be configured separately. By configuring individual queue instances,
a heterogeneous cluster of execution hosts can be managed by means of a
single cluster queue configuration and name. 
When modifying a cluster queue, all of its queue instances are modified
simultane-ously. Within a cluster queue, differences in the
configuration of queue instances can be specified by separately adding
the associated host and modifying it's attrib-utes. Consequently, a
typical setup might have only a few cluster queues, and the queue
instances controlled by those cluster queues remain largely ignored.
NOTE: The distinction between cluster queues and queue instances is
important. For example, jobs always run in queue instances, not in
cluster queues.
When configuring a cluster queue, any combination of the following host
objects can be associated with the cluster queue:
"	One execution host
"	A list of separate execution hosts
"	One or more host groups
A host group is a group of hosts that can be treated collectively as
identical. Host groups enable management of multiple hosts by means of a
single host group con-figuration. For more information about host
groups, see "Configuring Host Groups With QMON" in chapter 1 of the
Administration Guide.
When associating individual hosts with a cluster queue, the name of the
resulting queue instance on each host combines the cluster queue name
with the host name. The cluster queue name and the host name are
separated by an @ sign. For exam-ple, if associating the host myexechost
with the cluster queue myqueue, the result-ing queue instance is called
myqueue at myexechost. 
When associating a host group with a cluster queue, a queue domain is
created. Queue domains enable management of groups of queue instances
that are part of the same cluster queue and whose assigned hosts are
part of the same host group. A queue domain name combines a cluster
queue name with a host group name, separated by an @ sign. For example,
if the host group @myhostgroup (host group names must start with an @)
is associated with the cluster queue myqueue, the resulting queue domain
is myqueue@@myhostgroup.
1.6.1. Adding Queues
Using qmon, click the "Queue Control" button and then click the "Add"
button.  First, enter the "Queue Name" (by convention, queue names
always end in .q as in "fast.q").  Choose the name with care, it cannot
be changed later. 
Next, enter a host or host group name in the "New Host/Hostgroup" box
and click the red left arrow.  Enter as many hosts or host groups as
needed, their names will ap-pear in the "Hostlist" box at the top left
of the window.
The "@/" listing in the "Attributes for Host/Hostgroup" list on the
lower left of the win-dow denotes attributes that are the default for
each host or hostgroup in this queue.  Hosts or host groups from the
Hostlist box can be added to this listing and their at-tributes
specified differently from the defaults by entering their name in the
"New Host/Hostgroup" box and clicking the red up arrow.
NOTE: When changing an attribute, the padlock icon associated with the
attribute may need to be clicked to unlock the field for entry.
1.6.2. General Configuration
The following attributes will need to be set for queues in the tab with
the "General Configuration" label.
1.6.2.1. Processors
Set to the number of processors either as the default or for the
specific queue.
1.6.2.2. Slots
This is the number of jobs that can be active on a host simultaneously. 
This can be more than the number of processors if host over-scheduling
is desired.  Also, it could be twice the number of processors if they
are Intel processors with hyper-threading.  Experiment with what gives
the best overall performance for the host.
1.6.2.3. Notify Time
Scripts can catch signals sent by the Grid Engine to know when a job is
about to be killed.  Because of this, at least 1 minute of time should
be set in this field to allow for delays under heavy processor loads.
1.7. Cluster Configuration
1.7.1. Global Job Submission Parameters
The file $SGE_ROOT/<cellname>/common/sge_request contains default
parame-ters for the qsub command.  For ease of writing shell scripts
which will be submitted to SGE to run, the following parameters are
suggested.  These parameters can be ignored by using the -clear
parameter as the first parameter on the qsub command line.
-w e	Give errors and exit if a job being submitted can never be
scheduled.
-V	Export all variables from the user's environment into the job's
environ-ment.  By default, SGE builds a very minimal environment.
-cwd	Run the job in the directory from which it was submitted.  By
default SGE will run the job from the user's login directory.
1.7.2. Job/Shell Scripts
1.7.2.1. Shell Start Mode
For for ease of use of custom written scripts set the global cluster
shell start mode so that the shell for the script to use is given by the
first line of the script, just as if it was run directly from a command
line. To do this, configure the cluster global configuration (see
sge_conf(5) with the following:
shell_start_mode	unix_behavior

Using qmon, click the "Cluster Configuration" button, select "global" in
the left col-umn and click the "Modify" button.  In the General Settings
tab, find "Shell Start Mode" along the left side and set the selection
to unix_behavior. 
1.7.2.2. Active Comments
SGE allows for what are called active comments.  These comments are a
way to embed command line arguments to qsub in scripts.  By default,
active comments are found on lines that begin with "#$" (the "$" can be
changed).  The following are gen-erally useful:

#$ -o /dev/null -jy	Send all job output to the bit bucket.  By default,
SGE will send output to the file <jobname>.<jobid>.<tasknumber>.
#$ -m e	Send a notification email to the user submitting the job when it
completes.  Additional letters can be used in place of or appended to
the "e" with the following meaning:
`b'     Mail is sent at the beginning of the job.
`e'     Mail is sent at the end of the job.
`a'     Mail is sent when the job is aborted or rescheduled.
`s'     Mail is sent when the job is suspended.
`n'     No mail is sent.
1.7.2.3. SGE Environmental Variables
SGE makes available a number of environmental variables for use by job
scripts.  In order to make scripts run with or without SGE, the
following are several lines which have been found to be useful.  Note
that the syntax
": ${<variable>=<value>}"
tells the shell to set the given variable to the given value IF the
variable is not al-ready set.

# Set up restart status
: ${RESTARTED=0}

# Get our host name without any domain name.
xeqHost=`echo $HOSTNAME | sed 's/\..*//'`

# Get the name of the host that originally submitted the job
: ${SGE_O_HOST=`uname -n`}
submitHost=`echo $SGE_O_HOST | sed 's/\..*//'`

# Set a default task number if not using SGE
: ${SGE_TASK_ID=1}
# If SGE is was not given a rep number
if [ "$SGE_TASK_ID" = "undefined" ]
then
	SGE_TASK_ID=1
fi

# Get our comand name if not being run by the SGE
: ${REQUEST=$0}
myName=$REQUEST
cmdRoot=`basename $myName`
myPath=`dirname $myName`


1.8. Job Submission
1.8.1. Submitting Jobs
Chapter 3 of the User's Guide is an excellent resource for understand
how to submit job scripts or binary executables to SGE.  One additional
thing should be noted.  When using the qsub command, SGE does not search
the PATH variable to find the command being submitting. In other words,
if the following command is entered:

qsub -q fast.q -t 1-10 myScript

it will not work unless myScript is in the current directory.  Qsub will
issue the error message "Unable to read script file because of error:
error opening myScript: No such file or directory".
To prevent this, use the which command to find the script and then give
it to qsub as follows:

qsub -q fast.q -t 1-10 `which myScript`

It is also acceptable to just type in the absolute pathname to the
script: 
qsub -q fast.q -t 1-10 /home/john/.bin/myScript

On Wed, 2005-04-06 at 11:39, Schmitz Dale M Contr 20 IS/INPTG wrote:

> The job is a script...is there something else I must do for the engine?
> 
> -----Original Message-----
> From: raysonho at eseenet.com [mailto:raysonho at eseenet.com] 
> Sent: Wednesday, April 06, 2005 11:35 AM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Newbie questions
> 
> >Initial attempts at
> >running an application on the engine have all failed
> 
> How did you submit your jobs, and did you create a job script??
> 
> 
> > Does my software require recompiling for the grid engine
> > environment?  
> 
> No.
> 
> Rayson
> 
> ---------------------------------------------------------
> Get your FREE E-mail account at http://www.eseenet.com !
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



More information about the gridengine-users mailing list