[GE users] New, single machine setup, no submitted jobs being processed

jcholewa jcholewa at nshs.edu
Wed Nov 4 21:38:43 GMT 2009

Hi!  I'm a first time admin for gridengine, and I'm struggling a bit.  This is a single, 16-core Opteron machine, and I was asked to install the master and execution host onto it.  No other machines are involved.  The sge_qmaster and sge_execd services are running (and they seem to connect via `telnet localhost 6444` and `telnet localhost 6445`, though they don't give any output).  I wrote a very simple script (touch /tmp/somefilename, essentially) and tried to run it with qsub.  Here's the output:

# qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
      2 0.55500 qwe        root         qw    11/04/2009 15:41:18

# qstat -j 2
job_number:                 2
exec_file:                  job_scripts/2
submission_time:            Wed Nov  4 15:41:18 2009
owner:                      root
uid:                        0
group:                      root
gid:                        0
sge_o_home:                 /root
sge_o_log_name:             root
sge_o_path:                 /opt/sge/bin/lx24-amd64:/usr/sbin:/bin:/usr/bin:/sbin
sge_o_shell:                /bin/bash
sge_o_workdir:              /root
sge_o_host:                 sun
account:                    sge
mail_list:                  root at sun
notify:                     FALSE
job_name:                   qwe
jobshare:                   0
script_file:                /tmp/qwe
scheduling info:            (Collecting of scheduler job information is turned off)

And here are the settings I'm showing with qconf:

# qconf -sel

# qconf -secl
      ID NAME            HOST
       1 scheduler       sun
# qconf -sh

# qconf -shgrpl

# qconf -sm

# qconf -so
no operator defined

# qconf -sql

# qconf -ss

# qconf -sss

# qconf -sul

# qconf -suserl

# qconf -mhgrp @allhosts
group_name @allhosts
hostlist sun

I have also, on previous install attempts, tried adding "sun" as an exec host (`qconf -ae`), tried using "localhost" for all the above places that "sun" instead is listed, tried adding the manager as an operator as well (`qconf -ao`) and adding regular users as operators.

I've been using the regular gridengine documentation as well as "http://biowiki.org/HowToAdministerSunGridEngine" for assistance, but the sge_master just isn't picking up the jobs (I've tried different test scripts, btw).  I'm using opensuse 11.1, which itself may be a problem (Sun doesn't seem to know, across the board, that suse is higher than 10.3 these days), but I'm hoping to troubleshoot the problem before I have to reinstall with a much older version of the operating system.  Anybody have any potentially fruitful suggestions?

Below is a compressed summary of my install process (lines starting with '#' are stuff I typed in as root, lines starting with '>' are output from the installer, anything surrounding by '{{' and '}}' are the settings that I chose and steps that I took).

# tar -zxf ge62_lx24-amd64.tar.gz
# cd ge62
# tar -zxf ge-6.2-bin-lx24-amd64.tar.gz
# tar -zxf ge-6.2-common.tar.gz
{{removed entry from /etc/hosts that YaST added but inst_sge generally balks at}}
# ./inst_sge -m -x
{{installed as root}}
{{set Grid Engine root directory to /opt/sge}}
{{using network service to run sge_qmaster and sge_execd}}
{{cell name = default}}
{{cluster name = p6444 (default)}}
{{spool = /opt/sge/default/spool/qmaster (default)}}
{{no Windows Execution Hosts}}
{{had the installer check and set file permissions}}
{{all (one) hosts of cluster in one DNS domain}}
> creating directory: /opt/sge/default/spool/qmaster
> creating directory: /opt/sge/default/spool/qmaster/job_scripts"
{{berkeleydb used for spooling}}
{{no bdb spooling server needed}}
{{bdb dir = /opt/sge/default/spool/spooldb}}
{{group id range: 20000-21000 (default)}}
{{execd_spool_dir = /opt/sge/default/spool}}
{{administrator_mail = none}}
> cp /opt/sge/default/common/sgemaster /etc/init.d/sgemaster.p6444
> /usr/lib/lsb/install_initd /etc/init.d/sgemaster.p6444"
{{sge_qmaster started and verified with pgrep}}
{{Host(s): sun (convenient name of this machine, as it's the only sun box around) ... should "localhost" be added?}}
> adminhost "sun" already exists
> sun added to submit host list
{{declined adding shadow hosts}}
> root at sun added "@allhosts" to host group list
> root at sun added "all.q" to cluster queue list
{{used Configuration: Normal (default)}}
> You should now enter the command:
> source /opt/sge/default/common/settings.csh
> if you are a csh/tcsh user or
> . /opt/sge/default/common/settings.sh
> if you are a sh/ksh user.
> Grid Engine messages can be found at:
>    /tmp/qmaster_messages (during qmaster startup)
>    /tmp/execd_messages   (during execution daemon startup)
> After startup the daemons log their messages in their spool directories.
>    Qmaster:     /opt/sge/default/spool/qmaster/messages
>    Exec daemon: <execd_spool_dir>/<hostname>/messages
> Grid Engine startup scripts can be found at:
>    /opt/sge/default/common/sgemaster (qmaster)
>    /opt/sge/default/common/sgeexecd (execd)
> You may verify your administrative hosts with the command
>    # qconf -sh
> and you may add new administrative hosts with the command
>    # qconf -ah <hostname>
> The port for sge_execd is currently set as service.
>    sge_execd service set to port 6445
> This hostname is known at qmaster as an administrative host.
> The spool directory is currently set to:
> <</opt/sge/default/spool/sun>>
> root at sun modified "sun" in configuration list
> Local configuration for host >sun< created.
> cp /opt/sge/default/common/sgeexecd /etc/init.d/sgeexecd.p6444
> /usr/lib/lsb/install_initd /etc/init.d/sgeexecd.p6444
{{sge_execd started and verified with pgrep}}
{{added "a default queue instance for this host"}}
> root at sun modified "@allhosts" in host group list
> root at sun modified "all.q" in cluster queue list


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list