[GE users] cannot run on host until clean up of an previous run has finished

prentice prentice at ias.edu
Wed Feb 24 16:57:08 GMT 2010


I found it in the source. It's in source/libs/sched/sge_select_queue.c
This is from the latest CVS source, and I'm using 6.2u3.

I'm assuming the jids/taskids  or reschedule_unknown-list are stored in
a file somewhere, since reboots haven't fixed this problem. Where should
I look?

/* RU: */
   /*
   ** check if job can run on host based on the list of jids/taskids
   ** contained in the reschedule_unknown-list
   */
   if (a->ja_task) {
      lListElem *ruep;
      lList *rulp;
      u_long32 task_id;

      task_id = lGetUlong(a->ja_task, JAT_task_number);
      rulp = lGetList(host, EH_reschedule_unknown_list);

      for_each(ruep, rulp) {
         if (lGetUlong(ruep, RU_job_number) == a->job_id
             && lGetUlong(ruep, RU_task_number) == task_id) {
            DPRINTF(("RU: Job "sge_u32"."sge_u32" Host "SFN"\n", a->job_id,
               task_id, eh_name));
            schedd_mes_add(a->monitor_alpp, a->monitor_next_run, a->job_id,
                           SCHEDD_INFO_CLEANUPNECESSARY_S, eh_name);
            DRETURN(DISPATCH_NEVER_JOB);
         }
      }
   }



templedf wrote:
> There is no explicit way to clear that state that I recall.  I'd have to 
> go look at the source again to remember where exactly that state lives, 
> but you could try bouncing that execd, and if that doesn't clear it, try 
> bouncing the master.
> 
> Daniel
> 
> On 02/24/10 06:08, prentice wrote:
>> This problem bas been going on much longer than 5 minutes. Is there a
>> way to clear this "error"? No error is shown for the queue instance, but
>> jobs aren't running.
>>
>> templedf wrote:
>>    
>>> The "cleanup" really just an excuse.  When a job fails on a host,
>>> there's a timeout (5 minutes, I think) before it's allowed to try
>>> running on that host again.
>>>
>>> Daniel
>>>
>>> On 02/24/10 05:54, prentice wrote:
>>>      
>>>> Dear GU Users,
>>>>
>>>> A couple of weeks ago, that big snowstorm that hit the mid-atlantic took
>>>> out the power to my server room, causing the cluster to go down very
>>>> ungracefully.
>>>>
>>>> Now, a large job can't run because SGE says there's not enough slots for
>>>> the PE. When I do qstat -j<jobid>, I get a lot of messages like this:
>>>>
>>>> cannot run on host "node24.aurora" until clean up of an previous run has
>>>> finished
>>>>
>>>> I'm sure this is leftover from the ungraceful shutdown of SGE. What is
>>>> the best way to "clean up" these previous runs?
>>>>
>>>>
>>>>        
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245864
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>
>>>      
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245870
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 

-- 
Prentice Bisbal
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study
Princeton, NJ

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245907

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list