[GE users] sge 6.0u7 qrsh taking a long time to dispatch

Reuti reuti at staff.uni-marburg.de
Fri Dec 30 19:15:07 GMT 2005


Hi Kirk,

what is your setting of schedule_interval and flush_submit_sec in the  
scheduler configuration?

-- Reuti


Am 30.12.2005 um 19:45 schrieb Kirk Patton:

> Hello all,
>
> I have a test cluster set up that is running sge 6.0u7. I am  
> noticing that qrsh jobs are hanging
> for a period of time before they get started on a host.  It is  
> unusually long.  Somewhere between
> 45 second and a minute.
>
> I ran the submission under debug level 3 and there is some polling  
> taking place several times before the
> job starts.
>
> Has anyone seen anything like this?
>
> I am attaching the debug output.
>
> Thanks,
> Kirk
>
>
>      0   2703 182894037984     ****** starting localization  
> procedure ... **********
>      1   2703 182894037984     could not get environment variable  
> "GRIDPACKAGE"
>      2   2703 182894037984     could not get environment variable  
> "GRIDLOCALEDIR"
>      3   2703 182894037984     setlocale() returns "C"
>      4   2703 182894037984     locale directory: >/transmeta/sge/ 
> n1ge6-u7/locale<
>      5   2703 182894037984     package file:     >lx24-amd64/ 
> gridengine.mo<
>      6   2703 182894037984     language (LANG):  >C<
>      7   2703 182894037984     loading message file: /transmeta/sge/ 
> n1ge6-u7/locale/C/LC_MESSAGES/lx24-amd64/gridengine.mo
>      8   2703 182894037984     could not open message file - error
>      9   2703 182894037984     setlocale() returns "C"
>     10   2703 182894037984     bindtextdomain() returns "/transmeta/ 
> sge/n1ge6-u7/locale"
>     11   2703 182894037984     textdomain() returns "lx24-amd64/ 
> gridengine"
>     12   2703 182894037984     error id output     : disabled
>     13   2703 182894037984     ****** starting localization  
> procedure ... failed  **
>     14   2703 182894037984     Getting host by name - Linux
>     15   2703 182894037984     1 names in h_addr_list
>     16   2703 182894037984     0 names in h_aliases
>     17   2703 182894037984     me.who                      >14<
>     18   2703 182894037984     me.sge_formal_prog_name     >qrsh<
>     19   2703 182894037984     me.qualified_hostname        
> >captain.transmeta.com<
>     20   2703 182894037984     me.unqualified_hostname     >captain<
>     21   2703 182894037984     me.uid                      >1660<
>     22   2703 182894037984     me.gid                      >1660<
>     23   2703 182894037984     me.daemonized               >0<
>     24   2703 182894037984     me.user_name                >kpatton<
>     25   2703 182894037984     me.default_cell             >default<
>     26   2703 182894037984     sge_root            >/transmeta/sge/ 
> n1ge6-u7<
>     27   2703 182894037984     cell_root           >/transmeta/sge/ 
> n1ge6-u7/default<
>     28   2703 182894037984     conf_file           >/transmeta/sge/ 
> n1ge6-u7/default/common/bootstrap<
>     29   2703 182894037984     bootstrap_file      >/transmeta/sge/ 
> n1ge6-u7/default/common/configuration<
>     30   2703 182894037984     act_qmaster_file    >/transmeta/sge/ 
> n1ge6-u7/default/common/act_qmaster<
>     31   2703 182894037984     acct_file           >/transmeta/sge/ 
> n1ge6-u7/default/common/accounting<
>     32   2703 182894037984     reporting_file      >/transmeta/sge/ 
> n1ge6-u7/default/common/reporting<
>     33   2703 182894037984     local_conf_dir      >/transmeta/sge/ 
> n1ge6-u7/default/common/local_conf<
>     34   2703 182894037984     shadow_masters_file >/transmeta/sge/ 
> n1ge6-u7/default/common/shadow_masters<
>     35   2703 182894037984     admin_user          >none<
>     36   2703 182894037984     default_domain      >none<
>     37   2703 182894037984     ignore_fqdn         >true<
>     38   2703 182894037984     spooling_method     >classic<
>     39   2703 182894037984     spooling_lib        >libspoolc<
>     40   2703 182894037984     spooling_params     >/transmeta/sge/ 
> n1ge6-u7/default/common;/var/gridware/spool/transmeta<
>     41   2703 182894037984     binary_path         >/transmeta/sge/ 
> n1ge6-u7/bin<
>     42   2703 182894037984     qmaster_spool_dir   >/var/gridware/ 
> spool/transmeta<
>     43   2703 182894037984     security_mode        >afs<
>     44   2703 182894037984     (re-)reading act_qmaster file. Got  
> master host "sge-master1.transmeta.com"
>     45   2703 182894037984     ../libs/gdi/sge_any_request.c 515  
> starting up communication without threads
>     46   2703 182894037984     Getting host by name - Linux
>     47   2703 182894037984     1 names in h_addr_list
>     48   2703 182894037984     0 names in h_aliases
>     49   2703 182894037984     me.qualified_hostname:  
> captain.transmeta.com
>     50   2703 182894037984     secure dummy string:  
> AIMK_SECURE_OPTION_ENABLED
>     51   2703 182894037984     creating GDI handle
>     52   2703 182894037984     returning port value: 536
>     53   2703 182894037984     -- defaults file: /transmeta/sge/ 
> n1ge6-u7/default/common/sge_request
>     54   2703 182894037984     directive prefix = ""
>     55   2703 182894037984     -- defaults file /home/ 
> kpatton/.sge_request does not exist
>     56   2703 182894037984     -- defaults file /var/gridware/spool/ 
> transmeta/captain/.sge_request does not exist
>     57   2703 182894037984     "-q all.q at captain"
>     58   2703 182894037984     ===hostname===
>     59   2703 182894037984     Path Alias: ># (c) 2004 Sun  
> Microsystems, Inc. Use is subject to license terms.  <
>     60   2703 182894037984     Path Alias: >#<
>     61   2703 182894037984     Path Alias: ># Template Grid Engine  
> path aliasing configuration file<
>     62   2703 182894037984     Path Alias: >#<
>     63   2703 182894037984     Path Alias: ># The following entry  
> aliases physical address as generated by automounter<
>     64   2703 182894037984     Path Alias: ># (with a leading / 
> tmp_mnt) to the logical path (w/o leading /tmp_mnt).<
>     65   2703 182894037984     Path Alias: >#<
>     66   2703 182894037984     Path Alias: ># subm_dir	subm_host	 
> exec_host	path_replacement<
>     67   2703 182894037984     Path Alias: >/tmp_mnt/	*		       
> *		      /<
>     68   2703 182894037984     get_configuration: unique for  
> captain.transmeta.com: captain.transmeta.com
>     69   2703 182894037984     requesting global and  
> captain.transmeta.com
>     70   2703 182894037984     packing SGE_GDI_GET request
>     71   2703 182894037984     packing SGE_GDI_GET request
>     72   2703 182894037984     reresolve port timeout in 600
>     73   2703 182894037984     returning cached port value: 536
>     74   2703 182894037984     Getting host by name - Linux
>     75   2703 182894037984     1 names in h_addr_list
>     76   2703 182894037984     0 names in h_aliases
>     77   2703 182894037984     send request with id 1
>     78   2703 182894037984     unpacking SGE_GDI_GET request
>     79   2703 182894037984     in: request_id=1, sequence_id=1,  
> target=10, op=1
>     80   2703 182894037984     out: request_id=1, sequence_id=1,  
> target=10, op=1
>     81   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "log_warning" for loglevel
>     82   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "/transmeta/sge/n1ge6-u7/default/spool" for execd_spool_dir
>     83   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "/bin/mail" for mailer
>     84   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "/usr/bin/X11/xterm" for xterm
>     85   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "none" for load_sensor
>     86   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "none" for prolog
>     87   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "none" for epilog
>     88   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "posix_compliant" for shell_start_mode
>     89   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "sh,ksh,csh,tcsh" for login_shells
>     90   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "0" for min_uid
>     91   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "0" for min_gid
>     92   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "20000-20100" for gid_range
>     93   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "00:00:40" for load_report_time
>     94   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "false" for enforce_project
>     95   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "auto" for enforce_user
>     96   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "00:05:00" for max_unheard
>     97   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "log_warning" for loglevel
>     98   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "lsfadmin at transmeta.com" for administrator_mail
>     99   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "/transmeta/sge/n1ge6-u7/transmeta/scripts/token" for set_token_cmd
>    100   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "/usr/afsws/bin/pagsh" for pag_cmd
>    101   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "24:0:0" for token_extend_time
>    102   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "none" for shepherd_cmd
>    103   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "none" for qmaster_params
>    104   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "none" for execd_params
>    105   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "accounting=true reporting=false flush_time=00:00:15 joblog=false  
> sharelog=00:00:00" for reporting_params
>    106   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "100" for finished_jobs
>    107   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "none" for qlogin_daemon
>    108   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "none" for qlogin_command
>    109   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "/usr/sbin/sshd -i" for rsh_daemon
>    110   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "/usr/bin/ssh -t" for rsh_command
>    111   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "/usr/sbin/sshd -i" for rlogin_daemon
>    112   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "/usr/bin/ssh -t" for rlogin_command
>    113   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "00:00:00" for reschedule_unknown
>    114   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "2000" for max_aj_instances
>    115   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "75000" for max_aj_tasks
>    116   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "0" for max_u_jobs
>    117   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "0" for max_jobs
>    118   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "0" for reprioritize
>    119   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "0" for auto_user_oticket
>    120   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "0" for auto_user_fshare
>    121   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "none" for auto_user_default_project
>    122   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "86400" for auto_user_delete_time
>    123   2703 182894037984     ../libs/sgeobj/sge_conf.c 367 using  
> "false" for delegated_file_staging
>    124   2703 182894037984     Everything ok
>    125   2703 182894037984     qrsh will listen on port 38587
>    126   2703 182894037984     B E F O R E     S E N D I N  
> G! ! ! ! ! ! ! ! ! ! ! ! ! !
>    127   2703 182894037984      
> =====================================================
>    128   2703 182894037984     packing SGE_GDI_ADD request
>    129   2703 182894037984     packing SGE_GDI_ADD request
>    130   2703 182894037984     reresolve port timeout in 599
>    131   2703 182894037984     returning cached port value: 536
>    132   2703 182894037984     send request with id 2
>    133   2703 182894037984     unpacking SGE_GDI_ADD request
>    134   2703 182894037984     in: request_id=2, sequence_id=1,  
> target=5, op=258
>    135   2703 182894037984     out: request_id=2, sequence_id=1,  
> target=5, op=258
>    136   2703 182894037984     ../clients/qsh/qsh.c 1705 your job  
> 73 ("hostname") has been submitted
>    137   2703 182894037984     job id is: 73
>    138   2703 182894037984     R E A D I N G    J O  
> B ! ! ! ! ! ! ! ! ! ! !
>    139   2703 182894037984      
> ============================================
>    140   2703 182894037984     random polling set to 3
>    141   2703 182894037984     packing SGE_GDI_GET request
>    142   2703 182894037984     packing SGE_GDI_GET request
>    143   2703 182894037984     reresolve port timeout in 596
>    144   2703 182894037984     returning cached port value: 536
>    145   2703 182894037984     send request with id 1
>    146   2703 182894037984     unpacking SGE_GDI_GET request
>    147   2703 182894037984     in: request_id=3, sequence_id=1,  
> target=5, op=1
>    148   2703 182894037984     out: request_id=3, sequence_id=1,  
> target=5, op=1
>    149   2703 182894037984     Job Status is: 0 (unenrolled)
>    150   2703 182894037984     polling_interval set to 6
>    151   2703 182894037984     random polling set to 8
>    152   2703 182894037984     packing SGE_GDI_GET request
>    153   2703 182894037984     packing SGE_GDI_GET request
>    154   2703 182894037984     reresolve port timeout in 588
>    155   2703 182894037984     returning cached port value: 536
>    156   2703 182894037984     send request with id 1
>    157   2703 182894037984     unpacking SGE_GDI_GET request
>    158   2703 182894037984     in: request_id=4, sequence_id=1,  
> target=5, op=1
>    159   2703 182894037984     out: request_id=4, sequence_id=1,  
> target=5, op=1
>    160   2703 182894037984     Job Status is: 0 (unenrolled)
>    161   2703 182894037984     polling_interval set to 12
>    162   2703 182894037984     random polling set to 17
>    163   2703 182894037984     packing SGE_GDI_GET request
>    164   2703 182894037984     packing SGE_GDI_GET request
>    165   2703 182894037984     reresolve port timeout in 571
>    166   2703 182894037984     returning cached port value: 536
>    167   2703 182894037984     send request with id 1
>    168   2703 182894037984     unpacking SGE_GDI_GET request
>    169   2703 182894037984     in: request_id=5, sequence_id=1,  
> target=5, op=1
>    170   2703 182894037984     out: request_id=5, sequence_id=1,  
> target=5, op=1
>    171   2703 182894037984     Job Status is: 0 (unenrolled)
>    172   2703 182894037984     polling_interval set to 24
>    173   2703 182894037984     random polling set to 27
>    174   2703 182894037984     packing SGE_GDI_GET request
>    175   2703 182894037984     packing SGE_GDI_GET request
>    176   2703 182894037984     reresolve port timeout in 543
>    177   2703 182894037984     returning cached port value: 536
>    178   2703 182894037984     send request with id 1
>    179   2703 182894037984     unpacking SGE_GDI_GET request
>    180   2703 182894037984     in: request_id=6, sequence_id=1,  
> target=5, op=1
>    181   2703 182894037984     out: request_id=6, sequence_id=1,  
> target=5, op=1
>    182   2703 182894037984     Job Status is: 0 (unenrolled)
>    183   2703 182894037984     polling_interval set to 48
>    184   2703 182894037984     random polling set to 87
>    185   2703 182894037984     accepted client connection, fd = 3
>    186   2703 182894037984     qlogin_starter sent: 0:38605:/ 
> transmeta/sge/n1ge6-u7/utilbin/lx24-amd64:/transmeta/sge/n1ge6-u7/ 
> default/spool/captain/active_jobs/73.1:captain.transmeta.com
> captain.transmeta.com
> Connection to captain.transmeta.com closed.
>    187   2703 182894037984     accepted client connection, fd = 3
> -------------------------------
> JB_job_number        (Ulong)     = 0
> JB_job_name          (String)  * = hostname
> JB_version           (Ulong)     = 0
> JB_jid_request_list  (List)      = empty
> JB_jid_predecessor_l (List)      = empty
> JB_jid_sucessor_list (List)      = empty
> JB_session           (String)    = (null)
> JB_project           (String)    = (null)
> JB_department        (String)    = (null)
> JB_directive_prefix  (String)    = (null)
> JB_exec_file         (String)    = (null)
> JB_script_file       (String)  * = hostname
> JB_script_size       (Ulong)     = 0
> JB_script_ptr        (String)    = (null)
> JB_submission_time   (Ulong)   * = 1135962760
> JB_execution_time    (Ulong)     = 0
> JB_deadline          (Ulong)     = 0
> JB_owner             (String)  * = kpatton
> JB_uid               (Ulong)   * = 1660
> JB_group             (String)    = (null)
> JB_gid               (Ulong)     = 0
> JB_account           (String)    = (null)
> JB_cwd               (String)    = (null)
> JB_notify            (Bool)      = false
> JB_type              (Ulong)   * = 73
> JB_reserve           (Bool)      = false
> JB_priority          (Ulong)   * = 1024
> JB_jobshare          (Ulong)     = 0
> JB_shell_list        (List)      = empty
> JB_verify            (Ulong)     = 0
> JB_env_list          (List)    * = full {
>
>    List: <job_sublist> * #Elements: 7
>    -------------------------------
>    VA_variable          (String)  * = __SGE_PREFIX__O_HOME
>    VA_value             (String)  * = /home/kpatton
>    -------------------------------
>    VA_variable          (String)  * = __SGE_PREFIX__O_LOGNAME
>    VA_value             (String)  * = kpatton
>    -------------------------------
>    VA_variable          (String)  * = __SGE_PREFIX__O_PATH
>    VA_value             (String)  * = /transmeta/sge/n1ge6-u7/bin/ 
> lx24-amd64:/transmeta/sge/n1ge6-u7/bin/lx24-amd64:/transmeta/sge/ 
> n1ge6-u7/bin/lx24-amd64:/transmeta/sge/n1ge6-u6/transmeta/scripts:/ 
> transmeta/sge/n1ge6-u6/bin/lx24-amd64:/opt/modules/3.1.6/bin:/usr/ 
> kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/sbin:/ 
> sbin:/usr/local/lsf/bin:/home/kpatton/scripts
>    -------------------------------
>    VA_variable          (String)  * = __SGE_PREFIX__O_SHELL
>    VA_value             (String)  * = /bin/tcsh
>    -------------------------------
>    VA_variable          (String)  * = __SGE_PREFIX__O_MAIL
>    VA_value             (String)  * = /var/mail/kpatton
>    -------------------------------
>    VA_variable          (String)  * = __SGE_PREFIX__O_HOST
>    VA_value             (String)  * = captain.transmeta.com
>    -------------------------------
>    VA_variable          (String)  * = __SGE_PREFIX__O_WORKDIR
>    VA_value             (String)  * = /var/gridware/spool/transmeta/ 
> captain
> }
> JB_context           (List)      = empty
> JB_job_args          (List)      = empty
> JB_checkpoint_attr   (Ulong)     = 0
> JB_checkpoint_name   (String)    = (null)
> JB_checkpoint_object (Object)    = none
> JB_checkpoint_interv (Ulong)     = 0
> JB_restart           (Ulong)   * = 2
> JB_stdout_path_list  (List)      = empty
> JB_stderr_path_list  (List)      = empty
> JB_stdin_path_list   (List)      = empty
> JB_merge_stderr      (Bool)      = false
> JB_hard_resource_lis (List)      = empty
> JB_soft_resource_lis (List)      = empty
> JB_hard_queue_list   (List)    * = full {
>
>    List: <destin_ident_list> * #Elements: 1
>    -------------------------------
>    QR_name              (String)  * = all.q at captain
> }
> JB_soft_queue_list   (List)      = empty
> JB_mail_options      (Ulong)     = 0
> JB_mail_list         (List)    * = full {
>
>    List: <> * #Elements: 1
>    -------------------------------
>    MR_user              (String)  * = kpatton
>    MR_host              (Host)    * = captain.transmeta.com
> }
> JB_pe                (String)    = (null)
> JB_pe_range          (List)      = empty
> JB_master_hard_queue (List)      = empty
> JB_tgt               (String)    = (null)
> JB_cred              (String)    = (null)
> JB_ja_structure      (List)    * = full {
>
>    List: <task_id_range> * #Elements: 1
>    -------------------------------
>    RN_min               (Ulong)   * = 1
>    RN_max               (Ulong)   * = 1
>    RN_step              (Ulong)   * = 1
> }
> JB_ja_n_h_ids        (List)    * = full {
>
>    List: <task_id_range> * #Elements: 1
>    -------------------------------
>    RN_min               (Ulong)   * = 1
>    RN_max               (Ulong)   * = 1
>    RN_step              (Ulong)   * = 1
> }
> JB_ja_u_h_ids        (List)      = empty
> JB_ja_s_h_ids        (List)      = empty
> JB_ja_o_h_ids        (List)      = empty
> JB_ja_z_ids          (List)      = empty
> JB_ja_template       (List)      = empty
> JB_ja_tasks          (List)      = empty
> JB_host              (Host)      = (null)
> JB_category          (Ref)       = (nil)
> JB_user_list         (List)      = empty
> JB_job_identifier_li (List)      = empty
> JB_job_source        (String)    = (null)
> JB_verify_suitable_q (Ulong)     = 0
> JB_nrunning          (Ulong)     = 0
> JB_soft_wallclock_gm (Ulong)     = 0
> JB_hard_wallclock_gm (Ulong)     = 0
> JB_override_tickets  (Ulong)     = 0
> JB_qs_args           (List)      = empty
> JB_path_aliases      (List)      = empty
> JB_urg               (Double)    = 0.000000
> JB_nurg              (Double)    = 0.000000
> JB_nppri             (Double)    = 0.000000
> JB_rrcontr           (Double)    = 0.000000
> JB_dlcontr           (Double)    = 0.000000
> JB_wtcontr           (Double)    = 0.000000
>
> -- 
> Kirk Patton
> Unix Administrator
> Transmeta Inc.
> Tel. 408 919-3055
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list