[GE users] New, single machine setup, no submitted jobs being processed

jcholewa jcholewa at nshs.edu
Tue Nov 10 16:45:17 GMT 2009


> Am 06.11.2009 um 19:49 schrieb jcholewa:
> All looks fine. There is no load on the system and so also no other  
> (interactive) process is putting the queue into alarm state.
> 
> Can you change the setting of the scheduler (qconf -msconf) to  
> "schedd_job_info true" and run `qstat -j 2`again?
> 
> What do:
> 
> qstat -f
> 
> qstat -g c
>
> show?

# qconf -ssconf | grep schedd
schedd_job_info                   true

# qstat -j 11
==============================================================
job_number:                 11
exec_file:                  job_scripts/11
submission_time:            Tue Nov 10 11:07:16 2009
owner:                      root
uid:                        0
group:                      root
gid:                        0
sge_o_home:                 /root
sge_o_log_name:             root
sge_o_path:                 /opt/sge/bin/lx24-amd64:/usr/sbin:/bin:/usr/bin:/sbin
sge_o_shell:                /bin/bash
sge_o_workdir:              /opt
sge_o_host:                 sun
account:                    sge
mail_list:                  root at sun
notify:                     FALSE
job_name:                   qwe
jobshare:                   0
env_list:
script_file:                /tmp/qwe
scheduling info:            queue instance "all.q at sun" dropped because it is temporarily not available
                            All queues dropped because of overload or full


This above command I ran just now, well after the below paragraphs concerning checking the logs and so forth (also see below if you are wondering why it is currently at job 11).  Right now, I'm hunting through `man qmod` to see if I can clear it from being dropped.



# qstat -f
   queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q at sun                      BIP   0/0/16         -NA-     lx24-amd64    au

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
     11 0.55500 qwe        root         qw    11/10/2009 11:07:16     1
  

# qstat -g c
CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE
--------------------------------------------------------------------------------
all.q                             -NA-      0      0      0     16      0     16



The scheduling inexplicably started working yesterday.  I was able to submit a few test jobs, which were executed in seconds, and I told our resident dna scientist to submit something big.  When I checked it this morning, it was not working again.  I checked the qmaster log.  It usually isn't very informative, but this time around it gave some possibly helpful hints...

$SGE_ROOT/default/spool/qmaster/messages :
11/09/2009 18:52:45| timer|sun|W|got timeout error while write data to heartbeat file "heartbeat"
11/09/2009 19:00:42|event_|sun|E|acknowledge timeout after 600 seconds for event client (schedd:0) on host "sun"
11/09/2009 19:12:57|event_|sun|E|no event client known with id 1 to process acknowledgements
11/09/2009 19:19:32|event_|sun|E|no event client known with id 1 to modify
11/09/2009 19:19:32|event_|sun|E|no event client known with id 1 to process acknowledgements
(repeats once a minute through to this morning)

The big job would take far longer than 600 seconds.  I did some hunting and found "http://gridengine.sunsource.net/issues/show_bug.cgi?id=2890", which suggests a qmaster parameter "SCHEDULER_TIMEOUT" be given a "high value" (also it also says the bug is fixed, so this might not be my issue at all).  I added the variable with `qconf -mconf` (it wasn't there before, so please let me know if it needs to be added elsewhere instead) and set it to a year.


# qconf -secl
      ID NAME            HOST
--------------------------------------------------
       1 scheduler       sun
       

Just checked to make sure that  the scheduler isn't seen as dead.  It didn't work and I tried restarted the qmaster process.  It is currently not processing submitted jobs, as was happening originally.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=226054

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list