[GE users] Trying to use starter_method with sge 6.0

Andreas Haas Andreas.Haas at Sun.COM
Mon Nov 8 13:01:02 GMT 2004


I'm not aware about problems with starter method in 6.0.
Also in current maintrunk system it works nicely: A starter
procedure such as

   #!/bin/sh
   echo "--- start"
   $*
   echo "--- stop"

gets me the expected

   --- start
   Here I am. Sleeping now at: Mon Nov 8 13:56:55 MET 2004
   Now it is: Mon Nov 8 13:57:55 MET 2004
   --- stop

Regards,
Andreas

On Mon, 8 Nov 2004, Robert Olson wrote:

> Hi --
>
> I'm trying to use the starter_method queue property with a sge 6.0
> installation. As far as I can tell, this property is being ignored,
> even though the documentation seems to show it should work.
>
> I ended up running an sge_execd with debugging on, and see that the
> parameter just doesn't show up as being read. I know the queue configs
> are being seen, since if I change another value in them I see the
> change show up here.
>
> What I'm trying to do is run some portable scripts across different
> clusters that have different filesystem layouts; the starter_method
> would be a great way to do this. I've tried running a script without
> the starter, assuming the execvp() in the shepherd would find it in the
> path, but apparently the path that the execd is running with is
> different than the one that I gave it (I started it from my shell).
>
> Anyway, following is a copy of the debugging output showing the
> queueing configuration. If anyone has any insights, I'd love to hear
> them. (I've found so far that SGE has had mechanisms to solve the
> problems I've run into, so am hoping this one is also that way ...)
>
> thanks,
> --bob
>
> olson at tg-c052:~/SGE/default/common> env SGE_DEBUG_LEVEL="10 0 0 0 0 0 0
> 0" ../../bin/lx24-ia64/sge_execd
>       0  18953 1024     Getting host by name - Linux
>       1  18953 1024     1 names in h_addr_list
>       2  18953 1024     0 names in h_aliases
>       3  18953 1024     me.who                      >19<
>       4  18953 1024     me.sge_formal_prog_name     >execd<
>       5  18953 1024     me.qualified_hostname
>  >tg-c052.uc.teragrid.org<
>       6  18953 1024     me.unqualified_hostname     >tg-c052<
>       7  18953 1024     me.uid                      >762<
>       8  18953 1024     me.gid                      >100<
>       9  18953 1024     me.daemonized               >0<
>      10  18953 1024     me.user_name                >olson<
>      11  18953 1024     me.default_cell             >default<
>      12  18953 1024     sge_root            >/home/olson/SGE<
>      13  18953 1024     cell_root           >/home/olson/SGE/default<
>      14  18953 1024     conf_file
>  >/home/olson/SGE/default/common/bootstrap<
>      15  18953 1024     bootstrap_file
>  >/home/olson/SGE/default/common/configuration<
>      16  18953 1024     act_qmaster_file
>  >/home/olson/SGE/default/common/act_qmaster<
>      17  18953 1024     acct_file
>  >/home/olson/SGE/default/common/accounting<
>      18  18953 1024     reporting_file
>  >/home/olson/SGE/default/common/reporting<
>      19  18953 1024     local_conf_dir
>  >/home/olson/SGE/default/common/local_conf<
>      20  18953 1024     shadow_masters_file
>  >/home/olson/SGE/default/common/shadow_masters<
>      21  18953 1024     admin_user          >olson<
>      22  18953 1024     default_domain      >uc.teragrid.org<
>      23  18953 1024     ignore_fqdn         >false<
>      24  18953 1024     spooling_method     >classic<
>      25  18953 1024     spooling_lib        >libspoolc<
>      26  18953 1024     spooling_params
>  >/home/olson/SGE/default/common;/home/olson/SGE/default/spool/qmaster<
>      27  18953 1024     binary_path         >/home/olson/SGE/bin<
>      28  18953 1024     qmaster_spool_dir
>  >/home/olson/SGE/default/spool/qmaster<
>      29  18953 1024     security_mode        >none<
>      30  18953 1024     ../libs/gdi/sge_any_request.c 228 starting up
> communication without threads
>      31  18953 1024     Getting host by name - Linux
>      32  18953 1024     1 names in h_addr_list
>      33  18953 1024     0 names in h_aliases
>      34  18953 1024     me.qualified_hostname: tg-c052.uc.teragrid.org
>      35  18953 1024     returning port value: 10111
>      36  18953 1024     Getting host by name - Linux
>      37  18953 1024     1 names in h_addr_list
>      38  18953 1024     0 names in h_aliases
>      39  18953 1024     returning port value: 10112
>      40  18953 1024     Getting host by name - Linux
>      41  18953 1024     1 names in h_addr_list
>      42  18953 1024     0 names in h_aliases
>      43  18953 1024     ../libs/gdi/sge_any_request.c 362 created
> communication handel for component name "execd"
>      44  18953 1024     qualified hostname: tg-c052.uc.teragrid.org
>      45  18953 1024     get_configuration: unique for
> tg-c052.uc.teragrid.org: tg-c052.uc.teragrid.org
>      46  18953 1024     requesting global and tg-c052.uc.teragrid.org
>      47  18953 1024     reresolve port timeout in 600
>      48  18953 1024     returning cached port value: 10111
>      49  18953 1024     ../libs/gdi/sge_any_request.c 571 received from:
> a6.339.sc04.org,1
>      50  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
> "log_warning" for loglevel
>      51  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
> "/home/olson/SGE/default/spool" for execd_spool_dir
>      52  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
> "/usr/bin/mail" for mailer
>      53  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
> "/usr/bin/X11/xterm" for xterm
>      54  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
> load_sensor
>      55  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
> "/home/olson/SGE/transfer-prolog $fs_stdin_file_staging $fs_stdin_host
> $fs_stdin_path $fs_stdin_tmp_p" for prolog
>      56  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
> "/home/olson/SGE/transfer-epilog $fs_stdout_file_staging
> $fs_stdout_host $fs_stdout_path $fs_stdout_t" for epilog
>      57  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
> "posix_compliant" for shell_start_mode
>      58  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
> "sh,ksh,csh,tcsh" for login_shells
>      59  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "0" for
> min_uid
>      60  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "0" for
> min_gid
>      61  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
> "20000-20100" for gid_range
>      62  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "00:00:40"
> for load_report_time
>      63  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "false" for
> enforce_project
>      64  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "auto" for
> enforce_user
>      65  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "00:05:00"
> for max_unheard
>      66  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
> "log_warning" for loglevel
>      67  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
> administrator_mail
>      68  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
> set_token_cmd
>      69  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
> pag_cmd
>      70  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
> token_extend_time
>      71  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
> shepherd_cmd
>      72  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
> qmaster_params
>      73  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
> execd_params
>      74  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
> "accounting=true reporting=false flush_time=00:00:15 joblog=false
> sharelog=00:00:00" for reporting_params
>      75  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "100" for
> finished_jobs
>      76  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
> "/usr/libexec/telnetd" for qlogin_daemon
>      77  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "telnet" for
> qlogin_command
>      78  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
> rsh_daemon
>      79  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
> rsh_command
>      80  18953 1024     ../libs/sgeobj/sge_conf.c 294 using
> "/usr/libexec/rlogind" for rlogin_daemon
>      81  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
> rlogin_command
>      82  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "00:00:00"
> for reschedule_unknown
>      83  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "2000" for
> max_aj_instances
>      84  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "75000" for
> max_aj_tasks
>      85  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "0" for
> max_u_jobs
>      86  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "0" for
> max_jobs
>      87  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "0" for
> reprioritize
>      88  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "0" for
> auto_user_oticket
>      89  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "0" for
> auto_user_fshare
>      90  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "none" for
> auto_user_default_project
>      91  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "100" for
> auto_user_delete_time
>      92  18953 1024     ../libs/sgeobj/sge_conf.c 294 using "true" for
> delegated_file_staging
>      93  18953 1024     conf.execd_spool_dir
>  >/home/olson/SGE/default/spool<
>      94  18953 1024     conf.mailer                 >/usr/bin/mail<
>      95  18953 1024     conf.prolog
>  >/home/olson/SGE/transfer-prolog $fs_stdin_file_staging $fs_stdin_host
> $fs_stdin_path $fs_stdin_tmp_path<
>      96  18953 1024     conf.epilog
>  >/home/olson/SGE/transfer-epilog $fs_stdout_file_staging
> $fs_stdout_host $fs_stdout_path $fs_stdout_tmp_path
> $fs_stderr_file_staging $fs_stderr_host $fs_stderr_path
> $fs_stderr_tmp_path<
>      97  18953 1024     conf.shell_start_mode       >posix_compliant<
>      98  18953 1024     conf.login_shells           >sh,ksh,csh,tcsh<
>      99  18953 1024     conf.administrator_mail     >none<
>     100  18953 1024     conf.min_gid                >0<
>     101  18953 1024     conf.min_uid                >0<
>     102  18953 1024     conf.load_report_time       >40<
>     103  18953 1024     conf.max_unheard            >300<
>     104  18953 1024     conf.loglevel               >4<
>     105  18953 1024     conf.xterm                  >/usr/bin/X11/xterm<
>     106  18953 1024     conf.load_sensor            >none<
>     107  18953 1024     conf.enforce_project        >false<
>     108  18953 1024     conf.enforce_user           >auto<
>     109  18953 1024     conf.set_token_cmd          >none<
>     110  18953 1024     conf.pag_cmd                >none<
>     111  18953 1024     conf.token_extend_time      >0<
>     112  18953 1024     conf.shepherd_cmd           >none<
>     113  18953 1024     conf.qmaster_params         >none<
>     114  18953 1024     conf.execd_params           >none<
>     115  18953 1024     conf.gid_range              >20000-20100<
>     116  18953 1024     conf.zombie_jobs            >100<
>     117  18953 1024     conf.qlogin_daemon
>  >/usr/libexec/telnetd<
>     118  18953 1024     conf.qlogin_command         >telnet<
>     119  18953 1024     conf.rsh_daemon             >none<
>     120  18953 1024     conf.rsh_command            >none<
>     121  18953 1024     conf.rlogin_daemon
>  >/usr/libexec/rlogind<
>     122  18953 1024     conf.rlogin_command         >none<
>     123  18953 1024     conf.reschedule_unknown     >0<
>     124  18953 1024     conf.max_aj_instances       >2000<
>     125  18953 1024     conf.max_aj_tasks           >75000<
>     126  18953 1024     conf.max_u_jobs             >0<
>     127  18953 1024     conf.max_jobs               >0<
>     128  18953 1024     conf.reprioritize           >0<
>     129  18953 1024     conf.auto_user_oticket      >0<
>     130  18953 1024     conf.auto_user_fshare       >0<
>     131  18953 1024     conf.auto_user_default_project >none<
>     132  18953 1024     conf.auto_user_delete_time  >100<
>     133  18953 1024     conf.delegated_file_staging >true<
>     134  18953 1024     me.qualified_hostname: tg-c052.uc.teragrid.org
>     135  18953 1024     chdir("/")----------------------------
>     136  18953 1024     Making directories----------------------------
>     137  18953 1024
> chdir("/home/olson/SGE/default/spool")----------------------------
>     138  18953 1024
> chdir("tg-c052",me.unqualified_hostname)--------------------------
>     139  18953 1024     Making directories----------------------------
>     140  18953 1024     use_qidle: 0
>     141  18953 1024     ---> 0.000000 0.000000 0.000000 - 0
>     142  18953 1024     *****Checking In With qmaster*****
>     143  18953 1024     reresolve port timeout in 600
>     144  18953 1024     returning cached port value: 10111
>     145  18953 1024     ../libs/gdi/sge_any_request.c 571 received from:
> a6.339.sc04.org,1
>     146  18953 1024     ../daemons/common/shutdown.c 65 starting up 6.0u1
>     147  18953 1024     ../daemons/execd/execd.c 223 User 'root' did not
> start the application
>     148  18953 1024     ../daemons/execd/execd.c 231 successfully
> started PDC and PTF
>
>     149  18953 1024     ../daemons/execd/reaper_execd.c 1224 checking
> for old jobs
>     150  18953 1024     ../daemons/execd/reaper_execd.c 1245 no old jobs
> at startup
>     151  18953 1024     ALIVE TEST OF MASTER
>     152  18953 1024     ../libs/gdi/sge_any_request.c 688 qmaster is
> still running
>     153  18953 1024     ../libs/gdi/sge_any_request.c 693 endpoint is up
> since 15 seconds and has status 0
>     154  18953 1024     SENDING LOAD AND REPORTS
>     155  18953 1024      REPORT_JOB
>     156  18953 1024     reresolve port timeout in 600
>     157  18953 1024     returning cached port value: 10111
>     158  18953 1024     ../libs/gdi/sge_security.c 370 fromcommproc is
> empty string
>     159  18953 1024     ../libs/gdi/sge_security.c 389 standard gdi
> request to qmaster
>     160  18953 1024     receive_message_cach_n_ack() returns: got no
> message (//0)
>     161  18953 1024     No jobs to start
>     162  18953 1024     ALIVE TEST OF MASTER
>     163  18953 1024     ../libs/gdi/sge_any_request.c 688 qmaster is
> still running
>     164  18953 1024     ../libs/gdi/sge_any_request.c 693 endpoint is up
> since 16 seconds and has status 0
>     165  18953 1024     SENDING LOAD AND REPORTS
>     166  18953 1024      REPORT_LOAD
>     167  18953 1024     ---> 0.000000 0.000000 0.000000 - 0
>     168  18953 1024      REPORT_CONF
>     169  18953 1024      REPORT_LICENSE
>     170  18953 1024      REPORT_JOB
>     171  18953 1024     reresolve port timeout in 598
>     172  18953 1024     returning cached port value: 10111
>     173  18953 1024     ----> was_communication_error: no error happened
> (1000)
>     174  18953 1024     ====================[ DISPATCH EPOCH
> ]===========================
>     175  18953 1024     ../libs/gdi/sge_security.c 370 fromcommproc is
> empty string
>     176  18953 1024     ../libs/gdi/sge_security.c 389 standard gdi
> request to qmaster
>     177  18953 1024     receive_message_cach_n_ack() returns: got no
> message (//0)
>     178  18953 1024     ====================[ DISPATCH EPOCH
> ]===========================
>     179  18953 1024     ../libs/gdi/sge_security.c 370 fromcommproc is
> empty string
>     180  18953 1024     ../libs/gdi/sge_security.c 389 standard gdi
> request to qmaster
>     181  18953 1024     receive_message_cach_n_ack() returns: got no
> message (//0)
>     182  18953 1024     ====================[ DISPATCH EPOCH
> ]===========================
>     183  18953 1024     ../libs/gdi/sge_security.c 370 fromcommproc is
> empty string
>     184  18953 1024     ../libs/gdi/sge_security.c 389 standard gdi
> request to qmaster
>     185  18953 1024     receive_message_cach_n_ack() returns: got no
> message (//0)
>     186  18953 1024     ====================[ DISPATCH EPOCH
> ]===========================
>     187  18953 1024     ../daemons/common/shutdown.c 96 controlled
> shutdown 6.0u1
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list