[GE users] I need help badly.

Chris Dagdigian dag at sonsorol.org
Wed Feb 28 22:56:05 GMT 2007


You are putting a ton of effort into a really ancient version of Grid  
Engine that is not being actively developed any more (except for bug  
and security fixes I believe ..).

If this is a new install I'd *strongly* encourage you (again) to drop  
this version, surf on over to http://gridengine.sunsource.net and  
download the binaries for Grid Engine 6.0u10.  Is there a reason why  
you are using this old  package?

If you insist on keeping going with this version, it looks like you  
still have hostname issues (or at least an old queue or bootstrap  
file with the bad localhost.localdomain name still in it).

But your biggest problem as shown below comes from this telling error  
message:

>    starting sge_schedd
> critical error: scheduler already running


... that means that Grid Engine was not shut down completely in the  
past.

When you run "rcsge" on the master node, Grid Engine tries to start  
the "sge_qmaster" and "sge_schedd" programs at the same time. The  
error message above is telling you that there is already a running  
sge_schedd on the system when there shouldn't be.

The best thing to do is to cleanly shut down Grid Engine again.

And before you start it up again do this ...

   $ ps ax | grep sge

... you should *NOT* see any of the following:

   sge_commd
   sge_qmaster
   sge_execd

If you see any of those running you need to kill them off before re- 
trying the "./rcsge" script again.  The 'rcsge' startup script does  
not have a hope of working properly if there are old processes  
running that are binding to the SGE ports already or otherwise  
messing things up.

Basically you should clear this up first, it may be that most of your  
problems are coming from old SGE daemons still hanging around from  
prior install or startup attempts.

That said though, your best bet is to delete all of /gridware/sge*  
and start fresh with Grid Engine 6.0u10.  The hostname and DNS work  
you have done already will help make the 6.x install proceed.


Regards,
Chris




On Feb 28, 2007, at 4:18 PM, Trey wrote:

> OK, this worked however whenever I try to install the execution  
> hosts it cant contact the qmaster.  I tracked it down to not having  
> the service running.  when i start up rcsge it loads everything  
> cept fails at the localhost.q. I can not locate the localhost.q to  
> fix it.
>
>
> [root at centos1 common]# ./rcsge
>    starting sge_qmaster
> Reading in complexes:
>         Complex "host".
>         Complex "queue".
> Reading in execution hosts.
> Reading in administrative hosts.
> Reading in submit hosts.
> Reading in queues:
>         Queue "localhost.q".
> error: can't resolve hostname "localhost.localdomain"
> critical error: setup failed
>    starting sge_schedd
> critical error: scheduler already running
>
>
> I also get this error that I get another error and I cant seem to  
> track it down:
>
>
> [root at centos1 common]# ./rcsge -qmaster
>    starting sge_qmaster
> starting program: /gridware/sge/bin/lx24-amd64/sge_commd
> using service "sge_commd"
> bound to port 536
> Reading in complexes:
>         Complex "host".
>         Complex "queue".
> Reading in execution hosts.
> Reading in administrative hosts.
> Reading in submit hosts.
> Reading in parallel environments:
>         PE "make".
> Reading in scheduler configuration
>    starting sge_schedd
> error: getting configuration: unable to contact qmaster via ""  
> commd - qmaster n
> ot enrolled at commd
> error: can't get configuration from qmaster -- backgrounding
>
>
> How can I get this working?
>
>
> Chris Dagdigian wrote:
>> First things first, you are using a really old version of Grid  
>> Engine (the 5.3 series ...)
>> It would be a very unusual case for any *new* installation to  
>> require SGE 5.3.x
>> So^Cthe first thing you should do is head on over to http:// 
>> gridengine.sunsource.net and  grab the latest version of Grid  
>> Engine 6.0 binaries. The latest is 6.0u10.
>> Next you may want to take a look at some stuff I wrote a long time  
>> ago, it covers some of the pre-install things that can be  
>> significant:
>> http://gridengine.info/articles/2005/09/29/things-to-think-about- 
>> before-installing -Chris
>> On Feb 27, 2007, at 11:00 AM, Trey wrote:
>>> I need major help.  I am desperately tring to install grid engine  
>>> as a cluster software package on 3 servers that run cent OS.  I  
>>> have unpacked it and added a script /etc/profile.d/sge.sh  It  
>>> sets a path to the executables and a set a var of SGE_ROOT as / 
>>> gridware/sge.  When I try to run a script to add a host I get:
>>>
>>> [root at centos1 exec_hosts]# qconf -ah centos2.hyper.com
>>> critical error: Please set the environment variable SGE_ROOT.
>>>
>>>
>>> Also when I try to reinstall the qmaster I get:
>>>
>>> I get all sort of permission denied and unable to resolve  
>>> localhost.localdomain.
>>>
>>>
>>> How can I fix this?
>>>
>>>
>>>
>>> Setting.sh
>>>
>>> [root at centos1 common]# less settings.sh
>>> SGE_ROOT=/gridware/sge; export SGE_ROOT
>>>
>>> ARCH=`$SGE_ROOT/util/arch`
>>> DEFAULTMANPATH=`$SGE_ROOT/util/arch -m`
>>> MANTYPE=`$SGE_ROOT/util/arch -mt`
>>>
>>> unset SGE_CELL
>>> unset COMMD_PORT
>>>
>>> if [ "$MANPATH" = "" ]; then
>>>    MANPATH=$DEFAULTMANPATH
>>> fi
>>> MANPATH=$SGE_ROOT/$MANTYPE:$MANPATH; export MANPATH
>>>
>>> PATH=$SGE_ROOT/bin/$ARCH:$PATH; export PATH
>>> shlib_path_name=`$SGE_ROOT/util/arch -lib`
>>> old_value=`eval echo '$'$shlib_path_name`
>>> if [ x$old_value = x ]; then
>>>    eval $shlib_path_name=$SGE_ROOT/lib/$ARCH
>>> else
>>>    eval $shlib_path_name=$SGE_ROOT/lib/$ARCH:$old_value
>>> fi
>>> export $shlib_path_name
>>> unset ARCH DEFAULTMANPATH MANTYPE shlib_path_name
>>>
>>>
>>>
>>> SGE.sh ( A file to make sure it is running when started)
>>>
>>> [root at centos1 common]# less /etc/profile.d/sge.sh
>>> SGE_ROOT=/gridware/sge
>>> PATH=$PATH:$SGE_ROOT/bin/lx24-amd64
>>> if [ -are $SGE_ROOT/default/common/settings.sh ]; then
>>> . $SGE_ROOT/default/common/settings.sh
>>> fi
>>>
>>> [root at centos1 common]# less act_qmaster
>>> centos1.hyper.com
>>>
>>>
>>>
>>> Configuration file
>>>
>>> # Version: 5.3
>>> #
>>> # DO NOT MODIFY THIS FILE MANUALLY!
>>> #
>>> conf_version           0
>>> qmaster_spool_dir      /gridware/sge/default
>>> execd_spool_dir        /gridware/sge/default
>>> binary_path            /gridware/sge/bin
>>> mailer                 /bin/mail
>>> xterm                  /usr/bin/X11/xterm
>>> load_sensor            none
>>> prolog                 none
>>> epilog                 none
>>> shell_start_mode       posix_compliant
>>> login_shells           sh,ksh,csh,tcsh
>>> min_uid                0
>>> min_gid                0
>>> user_lists             none
>>> xuser_lists            none
>>> load_report_time       00:00:40
>>> stat_log_time          48:00:00
>>> max_unheard            00:05:00
>>> reschedule_unknown     00:00:00
>>> loglevel               log_warning
>>> administrator_mail     none
>>> set_token_cmd          none
>>> pag_cmd                none
>>> token_extend_time      none
>>> shepherd_cmd           none
>>> qmaster_params         none
>>> schedd_params          none
>>> execd_params           none
>>> finished_jobs          100
>>> gid_range              20000-20100
>>> admin_user             none
>>> qlogin_command         telnet
>>> qlogin_daemon          /usr/sbin/in.telnetd
>>> rlogin_daemon          /usr/sbin/in.rlogind
>>> default_domain         root
>>> ignore_fqdn            true
>>> max_aj_instances       2000
>>> max_aj_tasks           75000
>>> max_u_jobs             0
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> --No virus found in this incoming message.
>> Checked by AVG Free Edition.
>> Version: 7.5.446 / Virus Database: 268.18.4/703 - Release Date:  
>> 2/26/2007 2:56 PM
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list