[GE users] New, single machine setup, no submitted jobs being processed

reuti reuti at staff.uni-marburg.de
Tue Nov 10 21:46:00 GMT 2009


Am 10.11.2009 um 17:45 schrieb jcholewa:

>> Am 06.11.2009 um 19:49 schrieb jcholewa:
>> All looks fine. There is no load on the system and so also no other
>> (interactive) process is putting the queue into alarm state.
>>
>> Can you change the setting of the scheduler (qconf -msconf) to
>> "schedd_job_info true" and run `qstat -j 2`again?
>>
>> What do:
>>
>> qstat -f
>>
>> qstat -g c
>>
>> show?
>
> # qconf -ssconf | grep schedd
> schedd_job_info                   true
>
> # qstat -j 11
> ==============================================================
> job_number:                 11
> exec_file:                  job_scripts/11
> submission_time:            Tue Nov 10 11:07:16 2009
> owner:                      root
> uid:                        0
> group:                      root
> gid:                        0
> sge_o_home:                 /root
> sge_o_log_name:             root
> sge_o_path:                 /opt/sge/bin/lx24-amd64:/usr/sbin:/bin:/ 
> usr/bin:/sbin
> sge_o_shell:                /bin/bash
> sge_o_workdir:              /opt
> sge_o_host:                 sun
> account:                    sge
> mail_list:                  root at sun
> notify:                     FALSE
> job_name:                   qwe
> jobshare:                   0
> env_list:
> script_file:                /tmp/qwe
> scheduling info:            queue instance "all.q at sun" dropped  
> because it is temporarily not available
>                             All queues dropped because of overload  
> or full
>
>
> This above command I ran just now, well after the below paragraphs  
> concerning checking the logs and so forth (also see below if you  
> are wondering why it is currently at job 11).  Right now, I'm  
> hunting through `man qmod` to see if I can clear it from being  
> dropped.
>
>
>
> # qstat -f
>    queuename                      qtype resv/used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> -----------
> all.q at sun                      BIP   0/0/16         -NA-     lx24- 
> amd64    au

So the sge_execd vanished. IIRC this is just one machine, so the  
question is the reason for its abortion:

- is any of the file systems full?
- the machine has a fixed TCP/IP and doesn't get a random one?
- is there any file in /tmp created by the execd with an error message?
- the file system $SGE_ROOT/default/spool/qmaster is writable and the  
heartbeat from the last run is located there?
- any memory problem (`free` might tell)

-- Reuti


> ###################################################################### 
> ######
>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -  
> PENDING JOBS
> ###################################################################### 
> ######
>      11 0.55500 qwe        root         qw    11/10/2009  
> 11:07:16     1
>
>
> # qstat -g c
> CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL  
> aoACDS  cdsuE
> ---------------------------------------------------------------------- 
> ----------
> all.q                             -NA-      0      0      0      
> 16      0     16
>
>
>
> The scheduling inexplicably started working yesterday.  I was able  
> to submit a few test jobs, which were executed in seconds, and I  
> told our resident dna scientist to submit something big.  When I  
> checked it this morning, it was not working again.  I checked the  
> qmaster log.  It usually isn't very informative, but this time  
> around it gave some possibly helpful hints...
>
> $SGE_ROOT/default/spool/qmaster/messages :
> 11/09/2009 18:52:45| timer|sun|W|got timeout error while write data  
> to heartbeat file "heartbeat"
> 11/09/2009 19:00:42|event_|sun|E|acknowledge timeout after 600  
> seconds for event client (schedd:0) on host "sun"
> 11/09/2009 19:12:57|event_|sun|E|no event client known with id 1 to  
> process acknowledgements
> 11/09/2009 19:19:32|event_|sun|E|no event client known with id 1 to  
> modify
> 11/09/2009 19:19:32|event_|sun|E|no event client known with id 1 to  
> process acknowledgements
> (repeats once a minute through to this morning)
>
> The big job would take far longer than 600 seconds.  I did some  
> hunting and found "http://gridengine.sunsource.net/issues/ 
> show_bug.cgi?id=2890", which suggests a qmaster parameter  
> "SCHEDULER_TIMEOUT" be given a "high value" (also it also says the  
> bug is fixed, so this might not be my issue at all).  I added the  
> variable with `qconf -mconf` (it wasn't there before, so please let  
> me know if it needs to be added elsewhere instead) and set it to a  
> year.
>
>
> # qconf -secl
>       ID NAME            HOST
> --------------------------------------------------
>        1 scheduler       sun
>
>
> Just checked to make sure that  the scheduler isn't seen as dead.   
> It didn't work and I tried restarted the qmaster process.  It is  
> currently not processing submitted jobs, as was happening originally.
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=226054
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=226079

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list