[GE users] sge 6.0u7 qrsh taking a long time to dispatch

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Tue Jan 3 14:14:13 GMT 2006


Hi Kirk,

could you do the same monitoring with qping -dump and start an qrsh job?

Taht would give us important time information.

Thanks,
Stephan

Kirk Patton wrote On 12/30/05 19:45,:

>Hello all,
>
>I have a test cluster set up that is running sge 6.0u7. I am noticing that qrsh jobs are hanging
>for a period of time before they get started on a host.  It is unusually long.  Somewhere between
>45 second and a minute.
>
>I ran the submission under debug level 3 and there is some polling taking place several times before the 
>job starts.
>
>Has anyone seen anything like this?
>
>I am attaching the debug output.
>
>Thanks,
>Kirk
>
>
>     0   2703 182894037984     ****** starting localization procedure ... **********
>     1   2703 182894037984     could not get environment variable "GRIDPACKAGE"
>     2   2703 182894037984     could not get environment variable "GRIDLOCALEDIR"
>     3   2703 182894037984     setlocale() returns "C"
>     4   2703 182894037984     locale directory: >/transmeta/sge/n1ge6-u7/locale<
>     5   2703 182894037984     package file:     >lx24-amd64/gridengine.mo<
>     6   2703 182894037984     language (LANG):  >C<
>     7   2703 182894037984     loading message file: /transmeta/sge/n1ge6-u7/locale/C/LC_MESSAGES/lx24-amd64/gridengine.mo
>     8   2703 182894037984     could not open message file - error
>     9   2703 182894037984     setlocale() returns "C"
>    10   2703 182894037984     bindtextdomain() returns "/transmeta/sge/n1ge6-u7/locale"
>    11   2703 182894037984     textdomain() returns "lx24-amd64/gridengine"
>    12   2703 182894037984     error id output     : disabled
>    13   2703 182894037984     ****** starting localization procedure ... failed  **
>    14   2703 182894037984     Getting host by name - Linux
>    15   2703 182894037984     1 names in h_addr_list
>    16   2703 182894037984     0 names in h_aliases
>    17   2703 182894037984     me.who                      >14<
>    18   2703 182894037984     me.sge_formal_prog_name     >qrsh<
>    19   2703 182894037984     me.qualified_hostname       >captain.transmeta.com<
>    20   2703 182894037984     me.unqualified_hostname     >captain<
>    21   2703 182894037984     me.uid                      >1660<
>    22   2703 182894037984     me.gid                      >1660<
>    23   2703 182894037984     me.daemonized               >0<
>    24   2703 182894037984     me.user_name                >kpatton<
>    25   2703 182894037984     me.default_cell             >default<
>    26   2703 182894037984     sge_root            >/transmeta/sge/n1ge6-u7<
>    27   2703 182894037984     cell_root           >/transmeta/sge/n1ge6-u7/default<
>    28   2703 182894037984     conf_file           >/transmeta/sge/n1ge6-u7/default/common/bootstrap<
>    29   2703 182894037984     bootstrap_file      >/transmeta/sge/n1ge6-u7/default/common/configuration<
>    30   2703 182894037984     act_qmaster_file    >/transmeta/sge/n1ge6-u7/default/common/act_qmaster<
>    31   2703 182894037984     acct_file           >/transmeta/sge/n1ge6-u7/default/common/accounting<
>    32   2703 182894037984     reporting_file      >/transmeta/sge/n1ge6-u7/default/common/reporting<
>    33   2703 182894037984     local_conf_dir      >/transmeta/sge/n1ge6-u7/default/common/local_conf<
>    34   2703 182894037984     shadow_masters_file >/transmeta/sge/n1ge6-u7/default/common/shadow_masters<
>    35   2703 182894037984     admin_user          >none<
>    36   2703 182894037984     default_domain      >none<
>    37   2703 182894037984     ignore_fqdn         >true<
>    38   2703 182894037984     spooling_method     >classic<
>    39   2703 182894037984     spooling_lib        >libspoolc<
>    40   2703 182894037984     spooling_params     >/transmeta/sge/n1ge6-u7/default/common;/var/gridware/spool/transmeta<
>    41   2703 182894037984     binary_path         >/transmeta/sge/n1ge6-u7/bin<
>    42   2703 182894037984     qmaster_spool_dir   >/var/gridware/spool/transmeta<
>    43   2703 182894037984     security_mode        >afs<
>    44   2703 182894037984     (re-)reading act_qmaster file. Got master host "sge-master1.transmeta.com"
>    45   2703 182894037984     ../libs/gdi/sge_any_request.c 515 starting up communication without threads
>    46   2703 182894037984     Getting host by name - Linux
>    47   2703 182894037984     1 names in h_addr_list
>    48   2703 182894037984     0 names in h_aliases
>    49   2703 182894037984     me.qualified_hostname: captain.transmeta.com
>    50   2703 182894037984     secure dummy string: AIMK_SECURE_OPTION_ENABLED
>    51   2703 182894037984     creating GDI handle
>    52   2703 182894037984     returning port value: 536
>    53   2703 182894037984     -- defaults file: /transmeta/sge/n1ge6-u7/default/common/sge_request
>    54   2703 182894037984     directive prefix = ""
>    55   2703 182894037984     -- defaults file /home/kpatton/.sge_request does not exist
>    56   2703 182894037984     -- defaults file /var/gridware/spool/transmeta/captain/.sge_request does not exist
>    57   2703 182894037984     "-q all.q at captain"
>    58   2703 182894037984     ===hostname===
>    59   2703 182894037984     Path Alias: ># (c) 2004 Sun Microsystems, Inc. Use is subject to license terms.  <
>    60   2703 182894037984     Path Alias: >#<
>    61   2703 182894037984     Path Alias: ># Template Grid Engine path aliasing configuration file<
>    62   2703 182894037984     Path Alias: >#<
>    63   2703 182894037984     Path Alias: ># The following entry aliases physical address as generated by automounter<
>    64   2703 182894037984     Path Alias: ># (with a leading /tmp_mnt) to the logical path (w/o leading /tmp_mnt).<
>    65   2703 182894037984     Path Alias: >#<
>    66   2703 182894037984     Path Alias: ># subm_dir	subm_host	exec_host	path_replacement<
>    67   2703 182894037984     Path Alias: >/tmp_mnt/	*		      *		      /<
>    68   2703 182894037984     get_configuration: unique for captain.transmeta.com: captain.transmeta.com
>    69   2703 182894037984     requesting global and captain.transmeta.com
>    70   2703 182894037984     packing SGE_GDI_GET request
>    71   2703 182894037984     packing SGE_GDI_GET request
>    72   2703 182894037984     reresolve port timeout in 600
>    73   2703 182894037984     returning cached port value: 536
>    74   2703 182894037984     Getting host by name - Linux
>    75   2703 182894037984     1 names in h_addr_list
>    76   2703 182894037984     0 names in h_aliases
>    77   2703 182894037984     send request with id 1
>    78   2703 182894037984     unpacking SGE_GDI_GET request
>    79   2703 182894037984     in: request_id=1, sequence_id=1, target=10, op=1
>    80   2703 182894037984     out: request_id=1, sequence_id=1, target=10, op=1
>    81   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "log_warning" for loglevel
>    82   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "/transmeta/sge/n1ge6-u7/default/spool" for execd_spool_dir
>    83   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "/bin/mail" for mailer
>    84   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "/usr/bin/X11/xterm" for xterm
>    85   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "none" for load_sensor
>    86   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "none" for prolog
>    87   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "none" for epilog
>    88   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "posix_compliant" for shell_start_mode
>    89   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "sh,ksh,csh,tcsh" for login_shells
>    90   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "0" for min_uid
>    91   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "0" for min_gid
>    92   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "20000-20100" for gid_range
>    93   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "00:00:40" for load_report_time
>    94   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "false" for enforce_project
>    95   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "auto" for enforce_user
>    96   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "00:05:00" for max_unheard
>    97   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "log_warning" for loglevel
>    98   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "lsfadmin at transmeta.com" for administrator_mail
>    99   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "/transmeta/sge/n1ge6-u7/transmeta/scripts/token" for set_token_cmd
>   100   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "/usr/afsws/bin/pagsh" for pag_cmd
>   101   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "24:0:0" for token_extend_time
>   102   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "none" for shepherd_cmd
>   103   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "none" for qmaster_params
>   104   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "none" for execd_params
>   105   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "accounting=true reporting=false flush_time=00:00:15 joblog=false sharelog=00:00:00" for reporting_params
>   106   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "100" for finished_jobs
>   107   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "none" for qlogin_daemon
>   108   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "none" for qlogin_command
>   109   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "/usr/sbin/sshd -i" for rsh_daemon
>   110   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "/usr/bin/ssh -t" for rsh_command
>   111   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "/usr/sbin/sshd -i" for rlogin_daemon
>   112   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "/usr/bin/ssh -t" for rlogin_command
>   113   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "00:00:00" for reschedule_unknown
>   114   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "2000" for max_aj_instances
>   115   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "75000" for max_aj_tasks
>   116   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "0" for max_u_jobs
>   117   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "0" for max_jobs
>   118   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "0" for reprioritize
>   119   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "0" for auto_user_oticket
>   120   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "0" for auto_user_fshare
>   121   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "none" for auto_user_default_project
>   122   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "86400" for auto_user_delete_time
>   123   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using "false" for delegated_file_staging
>   124   2703 182894037984     Everything ok
>   125   2703 182894037984     qrsh will listen on port 38587
>   126   2703 182894037984     B E F O R E     S E N D I N G! ! ! ! ! ! ! ! ! ! ! ! ! !
>   127   2703 182894037984     =====================================================
>   128   2703 182894037984     packing SGE_GDI_ADD request
>   129   2703 182894037984     packing SGE_GDI_ADD request
>   130   2703 182894037984     reresolve port timeout in 599
>   131   2703 182894037984     returning cached port value: 536
>   132   2703 182894037984     send request with id 2
>   133   2703 182894037984     unpacking SGE_GDI_ADD request
>   134   2703 182894037984     in: request_id=2, sequence_id=1, target=5, op=258
>   135   2703 182894037984     out: request_id=2, sequence_id=1, target=5, op=258
>   136   2703 182894037984     ../clients/qsh/qsh.c 1705 your job 73 ("hostname") has been submitted
>   137   2703 182894037984     job id is: 73
>   138   2703 182894037984     R E A D I N G    J O B ! ! ! ! ! ! ! ! ! ! !
>   139   2703 182894037984     ============================================
>   140   2703 182894037984     random polling set to 3
>   141   2703 182894037984     packing SGE_GDI_GET request
>   142   2703 182894037984     packing SGE_GDI_GET request
>   143   2703 182894037984     reresolve port timeout in 596
>   144   2703 182894037984     returning cached port value: 536
>   145   2703 182894037984     send request with id 1
>   146   2703 182894037984     unpacking SGE_GDI_GET request
>   147   2703 182894037984     in: request_id=3, sequence_id=1, target=5, op=1
>   148   2703 182894037984     out: request_id=3, sequence_id=1, target=5, op=1
>   149   2703 182894037984     Job Status is: 0 (unenrolled)
>   150   2703 182894037984     polling_interval set to 6
>   151   2703 182894037984     random polling set to 8
>   152   2703 182894037984     packing SGE_GDI_GET request
>   153   2703 182894037984     packing SGE_GDI_GET request
>   154   2703 182894037984     reresolve port timeout in 588
>   155   2703 182894037984     returning cached port value: 536
>   156   2703 182894037984     send request with id 1
>   157   2703 182894037984     unpacking SGE_GDI_GET request
>   158   2703 182894037984     in: request_id=4, sequence_id=1, target=5, op=1
>   159   2703 182894037984     out: request_id=4, sequence_id=1, target=5, op=1
>   160   2703 182894037984     Job Status is: 0 (unenrolled)
>   161   2703 182894037984     polling_interval set to 12
>   162   2703 182894037984     random polling set to 17
>   163   2703 182894037984     packing SGE_GDI_GET request
>   164   2703 182894037984     packing SGE_GDI_GET request
>   165   2703 182894037984     reresolve port timeout in 571
>   166   2703 182894037984     returning cached port value: 536
>   167   2703 182894037984     send request with id 1
>   168   2703 182894037984     unpacking SGE_GDI_GET request
>   169   2703 182894037984     in: request_id=5, sequence_id=1, target=5, op=1
>   170   2703 182894037984     out: request_id=5, sequence_id=1, target=5, op=1
>   171   2703 182894037984     Job Status is: 0 (unenrolled)
>   172   2703 182894037984     polling_interval set to 24
>   173   2703 182894037984     random polling set to 27
>   174   2703 182894037984     packing SGE_GDI_GET request
>   175   2703 182894037984     packing SGE_GDI_GET request
>   176   2703 182894037984     reresolve port timeout in 543
>   177   2703 182894037984     returning cached port value: 536
>   178   2703 182894037984     send request with id 1
>   179   2703 182894037984     unpacking SGE_GDI_GET request
>   180   2703 182894037984     in: request_id=6, sequence_id=1, target=5, op=1
>   181   2703 182894037984     out: request_id=6, sequence_id=1, target=5, op=1
>   182   2703 182894037984     Job Status is: 0 (unenrolled)
>   183   2703 182894037984     polling_interval set to 48
>   184   2703 182894037984     random polling set to 87
>   185   2703 182894037984     accepted client connection, fd = 3
>   186   2703 182894037984     qlogin_starter sent: 0:38605:/transmeta/sge/n1ge6-u7/utilbin/lx24-amd64:/transmeta/sge/n1ge6-u7/default/spool/captain/active_jobs/73.1:captain.transmeta.com
>captain.transmeta.com
>Connection to captain.transmeta.com closed.
>   187   2703 182894037984     accepted client connection, fd = 3
>-------------------------------
>JB_job_number        (Ulong)     = 0
>JB_job_name          (String)  * = hostname
>JB_version           (Ulong)     = 0
>JB_jid_request_list  (List)      = empty
>JB_jid_predecessor_l (List)      = empty
>JB_jid_sucessor_list (List)      = empty
>JB_session           (String)    = (null)
>JB_project           (String)    = (null)
>JB_department        (String)    = (null)
>JB_directive_prefix  (String)    = (null)
>JB_exec_file         (String)    = (null)
>JB_script_file       (String)  * = hostname
>JB_script_size       (Ulong)     = 0
>JB_script_ptr        (String)    = (null)
>JB_submission_time   (Ulong)   * = 1135962760
>JB_execution_time    (Ulong)     = 0
>JB_deadline          (Ulong)     = 0
>JB_owner             (String)  * = kpatton
>JB_uid               (Ulong)   * = 1660
>JB_group             (String)    = (null)
>JB_gid               (Ulong)     = 0
>JB_account           (String)    = (null)
>JB_cwd               (String)    = (null)
>JB_notify            (Bool)      = false
>JB_type              (Ulong)   * = 73
>JB_reserve           (Bool)      = false
>JB_priority          (Ulong)   * = 1024
>JB_jobshare          (Ulong)     = 0
>JB_shell_list        (List)      = empty
>JB_verify            (Ulong)     = 0
>JB_env_list          (List)    * = full {
>
>   List: <job_sublist> * #Elements: 7
>   -------------------------------
>   VA_variable          (String)  * = __SGE_PREFIX__O_HOME
>   VA_value             (String)  * = /home/kpatton
>   -------------------------------
>   VA_variable          (String)  * = __SGE_PREFIX__O_LOGNAME
>   VA_value             (String)  * = kpatton
>   -------------------------------
>   VA_variable          (String)  * = __SGE_PREFIX__O_PATH
>   VA_value             (String)  * = /transmeta/sge/n1ge6-u7/bin/lx24-amd64:/transmeta/sge/n1ge6-u7/bin/lx24-amd64:/transmeta/sge/n1ge6-u7/bin/lx24-amd64:/transmeta/sge/n1ge6-u6/transmeta/scripts:/transmeta/sge/n1ge6-u6/bin/lx24-amd64:/opt/modules/3.1.6/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/sbin:/sbin:/usr/local/lsf/bin:/home/kpatton/scripts
>   -------------------------------
>   VA_variable          (String)  * = __SGE_PREFIX__O_SHELL
>   VA_value             (String)  * = /bin/tcsh
>   -------------------------------
>   VA_variable          (String)  * = __SGE_PREFIX__O_MAIL
>   VA_value             (String)  * = /var/mail/kpatton
>   -------------------------------
>   VA_variable          (String)  * = __SGE_PREFIX__O_HOST
>   VA_value             (String)  * = captain.transmeta.com
>   -------------------------------
>   VA_variable          (String)  * = __SGE_PREFIX__O_WORKDIR
>   VA_value             (String)  * = /var/gridware/spool/transmeta/captain
>}
>JB_context           (List)      = empty
>JB_job_args          (List)      = empty
>JB_checkpoint_attr   (Ulong)     = 0
>JB_checkpoint_name   (String)    = (null)
>JB_checkpoint_object (Object)    = none
>JB_checkpoint_interv (Ulong)     = 0
>JB_restart           (Ulong)   * = 2
>JB_stdout_path_list  (List)      = empty
>JB_stderr_path_list  (List)      = empty
>JB_stdin_path_list   (List)      = empty
>JB_merge_stderr      (Bool)      = false
>JB_hard_resource_lis (List)      = empty
>JB_soft_resource_lis (List)      = empty
>JB_hard_queue_list   (List)    * = full {
>
>   List: <destin_ident_list> * #Elements: 1
>   -------------------------------
>   QR_name              (String)  * = all.q at captain
>}
>JB_soft_queue_list   (List)      = empty
>JB_mail_options      (Ulong)     = 0
>JB_mail_list         (List)    * = full {
>
>   List: <> * #Elements: 1
>   -------------------------------
>   MR_user              (String)  * = kpatton
>   MR_host              (Host)    * = captain.transmeta.com
>}
>JB_pe                (String)    = (null)
>JB_pe_range          (List)      = empty
>JB_master_hard_queue (List)      = empty
>JB_tgt               (String)    = (null)
>JB_cred              (String)    = (null)
>JB_ja_structure      (List)    * = full {
>
>   List: <task_id_range> * #Elements: 1
>   -------------------------------
>   RN_min               (Ulong)   * = 1
>   RN_max               (Ulong)   * = 1
>   RN_step              (Ulong)   * = 1
>}
>JB_ja_n_h_ids        (List)    * = full {
>
>   List: <task_id_range> * #Elements: 1
>   -------------------------------
>   RN_min               (Ulong)   * = 1
>   RN_max               (Ulong)   * = 1
>   RN_step              (Ulong)   * = 1
>}
>JB_ja_u_h_ids        (List)      = empty
>JB_ja_s_h_ids        (List)      = empty
>JB_ja_o_h_ids        (List)      = empty
>JB_ja_z_ids          (List)      = empty
>JB_ja_template       (List)      = empty
>JB_ja_tasks          (List)      = empty
>JB_host              (Host)      = (null)
>JB_category          (Ref)       = (nil)
>JB_user_list         (List)      = empty
>JB_job_identifier_li (List)      = empty
>JB_job_source        (String)    = (null)
>JB_verify_suitable_q (Ulong)     = 0
>JB_nrunning          (Ulong)     = 0
>JB_soft_wallclock_gm (Ulong)     = 0
>JB_hard_wallclock_gm (Ulong)     = 0
>JB_override_tickets  (Ulong)     = 0
>JB_qs_args           (List)      = empty
>JB_path_aliases      (List)      = empty
>JB_urg               (Double)    = 0.000000
>JB_nurg              (Double)    = 0.000000
>JB_nppri             (Double)    = 0.000000
>JB_rrcontr           (Double)    = 0.000000
>JB_dlcontr           (Double)    = 0.000000
>JB_wtcontr           (Double)    = 0.000000
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list