[GE users] Trying to use starter_method with sge 6.0

Robert Olson olson at mcs.anl.gov
Mon Nov 8 13:59:25 GMT 2004


Hmm. How do you configure it? I was assuming this is one of the 
settings set with 'qconf -mq qname'.

--bob

On Nov 8, 2004, at 8:01 AM, Andreas Haas wrote:

> I'm not aware about problems with starter method in 6.0.
> Also in current maintrunk system it works nicely: A starter
> procedure such as
>
>    #!/bin/sh
>    echo "--- start"
>    $*
>    echo "--- stop"
>
> gets me the expected
>
>    --- start
>    Here I am. Sleeping now at: Mon Nov 8 13:56:55 MET 2004
>    Now it is: Mon Nov 8 13:57:55 MET 2004
>    --- stop
>
> Regards,
> Andreas
>
> On Mon, 8 Nov 2004, Robert Olson wrote:
>
>> Hi --
>>
>> I'm trying to use the starter_method queue property with a sge 6.0
>> installation. As far as I can tell, this property is being ignored,
>> even though the documentation seems to show it should work.
>>
>> I ended up running an sge_execd with debugging on, and see that the
>> parameter just doesn't show up as being read. I know the queue configs
>> are being seen, since if I change another value in them I see the
>> change show up here.
>>
>> What I'm trying to do is run some portable scripts across different
>> clusters that have different filesystem layouts; the starter_method
>> would be a great way to do this. I've tried running a script without
>> the starter, assuming the execvp() in the shepherd would find it in 
>> the
>> path, but apparently the path that the execd is running with is
>> different than the one that I gave it (I started it from my shell).
>>
>> Anyway, following is a copy of the debugging output showing the
>> queueing configuration. If anyone has any insights, I'd love to hear
>> them. (I've found so far that SGE has had mechanisms to solve the
>> problems I've run into, so am hoping this one is also that way ...)
>>
>> thanks,
>> --bob
>>
>> olson at tg-c052:~/SGE/default/common> env SGE_DEBUG_LEVEL="10 0 0 0 0 0 
>> 0
>> 0" ../../bin/lx24-ia64/sge_execd
>>       0  18953 1024     Getting host by name - Linux
>>       1  18953 1024     1 names in h_addr_list
>>       2  18953 1024     0 names in h_aliases
>>       3  18953 1024     me.who                      >19<
>>       4  18953 1024     me.sge_formal_prog_name     >execd<
>>       5  18953 1024     me.qualified_hostname
>>> tg-c052.uc.teragrid.org<
>>       6  18953 1024     me.unqualified_hostname     >tg-c052<
>>       7  18953 1024     me.uid                      >762<
>>       8  18953 1024     me.gid                      >100<
>>       9  18953 1024     me.daemonized               >0<
>>      10  18953 1024     me.user_name                >olson<
>>      11  18953 1024     me.default_cell             >default<
>>      12  18953 1024     sge_root            >/home/olson/SGE<
>>      13  18953 1024     cell_root           >/home/olson/SGE/default<
>>      14  18953 1024     conf_file
>>> /home/olson/SGE/default/common/bootstrap<
>>      15  18953 1024     bootstrap_file
>>> /home/olson/SGE/default/common/configuration<
>>      16  18953 1024     act_qmaster_file
>>> /home/olson/SGE/default/common/act_qmaster<
>>      17  18953 1024     acct_file
>>> /home/olson/SGE/default/common/accounting<
>>      18  18953 1024     reporting_file
>>> /home/olson/SGE/default/common/reporting<
>>      19  18953 1024     local_conf_dir
>>> /home/olson/SGE/default/common/local_conf<
>>      20  18953 1024     shadow_masters_file
>>> /home/olson/SGE/default/common/shadow_masters<
>>      21  18953 1024     admin_user          >olson<
>>      22  18953 1024     default_domain      >uc.teragrid.org<
>>      23  18953 1024     ignore_fqdn         >false<
>>      24  18953 1024     spooling_method     >classic<
>>      25  18953 1024     spooling_lib        >libspoolc<
>>      26  18953 1024     spooling_params
>>> /home/olson/SGE/default/common;/home/olson/SGE/default/spool/qmaster<
>>      27  18953 1024     binary_path         >/home/olson/SGE/bin<
>>      28  18953 1024     qmaster_spool_dir
>>> /home/olson/SGE/default/spool/qmaster<
>>      29  18953 1024     security_mode        >none<
>>      30  18953 1024     ../libs/gdi/sge_any_request.c 228 starting up
>> communication without threads
>>      31  18953 1024     Getting host by name - Linux
>>      32  18953 1024     1 names in h_addr_list
>>      33  18953 1024     0 names in h_aliases
>>      34  18953 1024     me.qualified_hostname: tg-c052.uc.teragrid.org
>>      35  18953 1024     returning port value: 10111
>>      36  18953 1024     Getting host by name - Linux
>>      37  18953 1024     1 names in h_addr_list
>>      38  18953 1024     0 names in h_aliases
>>      39  18953 1024     returning port value: 10112
>>      40  18953 1024     Getting host by name - Linux
>>      41  18953 1024     1 names in h_addr_list
>>      42  18953 1024     0 names in h_aliases
>>      43  18953 1024     ../libs/gdi/sge_any_request.c 362 created
>> communication handel for component name "execd"
>>      44  18953 1024     qualified hostname: tg-c052.uc.teragrid.org
>>      45  18953 1024     get_configuration: unique for
>> tg-c052.uc.teragrid.org: tg-c052.uc.teragrid.org
>>      46  18953 1024     requesting global and tg-c052.uc.teragrid.org
>>      47  18953 1024     reresolve port timeout in 600
>>      48  18953 1024     returning cached port value: 10111
>>      49  18953 1024     ../libs/gdi/sge_any_request.c 571 received 
>> from:
>> a6.339.sc04.org,1
>>      50  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
>> "log_warning" for loglevel
>>      51  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
>> "/home/olson/SGE/default/spool" for execd_spool_dir
>>      52  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
>> "/usr/bin/mail" for mailer
>>      53  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
>> "/usr/bin/X11/xterm" for xterm
>>      54  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
>> load_sensor
>>      55  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
>> "/home/olson/SGE/transfer-prolog $fs_stdin_file_staging $fs_stdin_host
>> $fs_stdin_path $fs_stdin_tmp_p" for prolog
>>      56  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
>> "/home/olson/SGE/transfer-epilog $fs_stdout_file_staging
>> $fs_stdout_host $fs_stdout_path $fs_stdout_t" for epilog
>>      57  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
>> "posix_compliant" for shell_start_mode
>>      58  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
>> "sh,ksh,csh,tcsh" for login_shells
>>      59  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "0" for
>> min_uid
>>      60  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "0" for
>> min_gid
>>      61  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
>> "20000-20100" for gid_range
>>      62  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "00:00:40"
>> for load_report_time
>>      63  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "false" 
>> for
>> enforce_project
>>      64  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "auto" for
>> enforce_user
>>      65  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "00:05:00"
>> for max_unheard
>>      66  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
>> "log_warning" for loglevel
>>      67  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
>> administrator_mail
>>      68  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
>> set_token_cmd
>>      69  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
>> pag_cmd
>>      70  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
>> token_extend_time
>>      71  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
>> shepherd_cmd
>>      72  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
>> qmaster_params
>>      73  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
>> execd_params
>>      74  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
>> "accounting=true reporting=false flush_time=00:00:15 joblog=false
>> sharelog=00:00:00" for reporting_params
>>      75  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "100" for
>> finished_jobs
>>      76  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
>> "/usr/libexec/telnetd" for qlogin_daemon
>>      77  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "telnet" 
>> for
>> qlogin_command
>>      78  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
>> rsh_daemon
>>      79  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
>> rsh_command
>>      80  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
>> "/usr/libexec/rlogind" for rlogin_daemon
>>      81  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
>> rlogin_command
>>      82  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "00:00:00"
>> for reschedule_unknown
>>      83  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "2000" for
>> max_aj_instances
>>      84  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "75000" 
>> for
>> max_aj_tasks
>>      85  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "0" for
>> max_u_jobs
>>      86  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "0" for
>> max_jobs
>>      87  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "0" for
>> reprioritize
>>      88  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "0" for
>> auto_user_oticket
>>      89  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "0" for
>> auto_user_fshare
>>      90  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
>> auto_user_default_project
>>      91  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "100" for
>> auto_user_delete_time
>>      92  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "true" for
>> delegated_file_staging
>>      93  18953 1024     conf.execd_spool_dir
>>> /home/olson/SGE/default/spool<
>>      94  18953 1024     conf.mailer                 >/usr/bin/mail<
>>      95  18953 1024     conf.prolog
>>> /home/olson/SGE/transfer-prolog $fs_stdin_file_staging $fs_stdin_host
>> $fs_stdin_path $fs_stdin_tmp_path<
>>      96  18953 1024     conf.epilog
>>> /home/olson/SGE/transfer-epilog $fs_stdout_file_staging
>> $fs_stdout_host $fs_stdout_path $fs_stdout_tmp_path
>> $fs_stderr_file_staging $fs_stderr_host $fs_stderr_path
>> $fs_stderr_tmp_path<
>>      97  18953 1024     conf.shell_start_mode       >posix_compliant<
>>      98  18953 1024     conf.login_shells           >sh,ksh,csh,tcsh<
>>      99  18953 1024     conf.administrator_mail     >none<
>>     100  18953 1024     conf.min_gid                >0<
>>     101  18953 1024     conf.min_uid                >0<
>>     102  18953 1024     conf.load_report_time       >40<
>>     103  18953 1024     conf.max_unheard            >300<
>>     104  18953 1024     conf.loglevel               >4<
>>     105  18953 1024     conf.xterm                  
>> >/usr/bin/X11/xterm<
>>     106  18953 1024     conf.load_sensor            >none<
>>     107  18953 1024     conf.enforce_project        >false<
>>     108  18953 1024     conf.enforce_user           >auto<
>>     109  18953 1024     conf.set_token_cmd          >none<
>>     110  18953 1024     conf.pag_cmd                >none<
>>     111  18953 1024     conf.token_extend_time      >0<
>>     112  18953 1024     conf.shepherd_cmd           >none<
>>     113  18953 1024     conf.qmaster_params         >none<
>>     114  18953 1024     conf.execd_params           >none<
>>     115  18953 1024     conf.gid_range              >20000-20100<
>>     116  18953 1024     conf.zombie_jobs            >100<
>>     117  18953 1024     conf.qlogin_daemon
>>> /usr/libexec/telnetd<
>>     118  18953 1024     conf.qlogin_command         >telnet<
>>     119  18953 1024     conf.rsh_daemon             >none<
>>     120  18953 1024     conf.rsh_command            >none<
>>     121  18953 1024     conf.rlogin_daemon
>>> /usr/libexec/rlogind<
>>     122  18953 1024     conf.rlogin_command         >none<
>>     123  18953 1024     conf.reschedule_unknown     >0<
>>     124  18953 1024     conf.max_aj_instances       >2000<
>>     125  18953 1024     conf.max_aj_tasks           >75000<
>>     126  18953 1024     conf.max_u_jobs             >0<
>>     127  18953 1024     conf.max_jobs               >0<
>>     128  18953 1024     conf.reprioritize           >0<
>>     129  18953 1024     conf.auto_user_oticket      >0<
>>     130  18953 1024     conf.auto_user_fshare       >0<
>>     131  18953 1024     conf.auto_user_default_project >none<
>>     132  18953 1024     conf.auto_user_delete_time  >100<
>>     133  18953 1024     conf.delegated_file_staging >true<
>>     134  18953 1024     me.qualified_hostname: tg-c052.uc.teragrid.org
>>     135  18953 1024     chdir("/")----------------------------
>>     136  18953 1024     Making directories----------------------------
>>     137  18953 1024
>> chdir("/home/olson/SGE/default/spool")----------------------------
>>     138  18953 1024
>> chdir("tg-c052",me.unqualified_hostname)--------------------------
>>     139  18953 1024     Making directories----------------------------
>>     140  18953 1024     use_qidle: 0
>>     141  18953 1024     ---> 0.000000 0.000000 0.000000 - 0
>>     142  18953 1024     *****Checking In With qmaster*****
>>     143  18953 1024     reresolve port timeout in 600
>>     144  18953 1024     returning cached port value: 10111
>>     145  18953 1024     ../libs/gdi/sge_any_request.c 571 received 
>> from:
>> a6.339.sc04.org,1
>>     146  18953 1024     ../daemons/common/shutdown.c 65 starting up 
>> 6.0u1
>>     147  18953 1024     ../daemons/execd/execd.c 223 User 'root' did 
>> not
>> start the application
>>     148  18953 1024     ../daemons/execd/execd.c 231 successfully
>> started PDC and PTF
>>
>>     149  18953 1024     ../daemons/execd/reaper_execd.c 1224 checking
>> for old jobs
>>     150  18953 1024     ../daemons/execd/reaper_execd.c 1245 no old 
>> jobs
>> at startup
>>     151  18953 1024     ALIVE TEST OF MASTER
>>     152  18953 1024     ../libs/gdi/sge_any_request.c 688 qmaster is
>> still running
>>     153  18953 1024     ../libs/gdi/sge_any_request.c 693 endpoint is 
>> up
>> since 15 seconds and has status 0
>>     154  18953 1024     SENDING LOAD AND REPORTS
>>     155  18953 1024      REPORT_JOB
>>     156  18953 1024     reresolve port timeout in 600
>>     157  18953 1024     returning cached port value: 10111
>>     158  18953 1024     ../libs/gdi/sge_security.c 370 fromcommproc is
>> empty string
>>     159  18953 1024     ../libs/gdi/sge_security.c 389 standard gdi
>> request to qmaster
>>     160  18953 1024     receive_message_cach_n_ack() returns: got no
>> message (//0)
>>     161  18953 1024     No jobs to start
>>     162  18953 1024     ALIVE TEST OF MASTER
>>     163  18953 1024     ../libs/gdi/sge_any_request.c 688 qmaster is
>> still running
>>     164  18953 1024     ../libs/gdi/sge_any_request.c 693 endpoint is 
>> up
>> since 16 seconds and has status 0
>>     165  18953 1024     SENDING LOAD AND REPORTS
>>     166  18953 1024      REPORT_LOAD
>>     167  18953 1024     ---> 0.000000 0.000000 0.000000 - 0
>>     168  18953 1024      REPORT_CONF
>>     169  18953 1024      REPORT_LICENSE
>>     170  18953 1024      REPORT_JOB
>>     171  18953 1024     reresolve port timeout in 598
>>     172  18953 1024     returning cached port value: 10111
>>     173  18953 1024     ----> was_communication_error: no error 
>> happened
>> (1000)
>>     174  18953 1024     ====================[ DISPATCH EPOCH
>> ]===========================
>>     175  18953 1024     ../libs/gdi/sge_security.c 370 fromcommproc is
>> empty string
>>     176  18953 1024     ../libs/gdi/sge_security.c 389 standard gdi
>> request to qmaster
>>     177  18953 1024     receive_message_cach_n_ack() returns: got no
>> message (//0)
>>     178  18953 1024     ====================[ DISPATCH EPOCH
>> ]===========================
>>     179  18953 1024     ../libs/gdi/sge_security.c 370 fromcommproc is
>> empty string
>>     180  18953 1024     ../libs/gdi/sge_security.c 389 standard gdi
>> request to qmaster
>>     181  18953 1024     receive_message_cach_n_ack() returns: got no
>> message (//0)
>>     182  18953 1024     ====================[ DISPATCH EPOCH
>> ]===========================
>>     183  18953 1024     ../libs/gdi/sge_security.c 370 fromcommproc is
>> empty string
>>     184  18953 1024     ../libs/gdi/sge_security.c 389 standard gdi
>> request to qmaster
>>     185  18953 1024     receive_message_cach_n_ack() returns: got no
>> message (//0)
>>     186  18953 1024     ====================[ DISPATCH EPOCH
>> ]===========================
>>     187  18953 1024     ../daemons/common/shutdown.c 96 controlled
>> shutdown 6.0u1
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list