[GE users] Scheduler Problems ... HELP!

murphygb brian.murphy at siemens.com
Fri Jul 16 18:37:03 BST 2010


Daniel,

What I can tell you is with sched_info turned on and a user submitting more than X number of jobs, memory usage of the master process goes through the roof and OGE pends all incoming jobs.  qstat -j for every job says:

Can not get job info messages, scheduler is not available

As soon as we turn it off and restart the master everything seems fine.  We found info on the 'net regarding setting qmaster_params to SCHEDULER_TIMEOUT='something between 600 and 1200' but that did not help.  Turning off sched info is the only thing we seem to be able to do to correct the situation.  I hope it is not a "significant misconfiguration" since we paid a lot of money for a consultant to come in and configure the system.

> Brian,
> 
> Something's not right. :)  730 jobs is a drop in the bucket.  With a 
> master that size you should be able to handle 200,000 jobs easily. 
> Sounds like there's either a significant misconfiguration or a 
> misinterpretation of the problem.  (OGE doesn't always make it easy to 
> understand what's happening.)
> 
> If the pegged CPU is persistent, try turning on debug mode and see what 
> it's doing:
> 
> http://blogs.sun.com/templedf/entry/using_debugging_output
> 
> I'd try level 1 to start with.
> 
> Daniel
> 
> On 07/13/10 11:01 AM, murphygb wrote:
> >> Hi,
> >>
> >> Am 13.07.2010 um 16:08 schrieb murphygb:
> >>
> >>> I have job that seems to be stuck in the scheduler and all jobs that get submitted are pending.  From my 'messages' file I have a ton of these:
> >>>
> >>> 07/13/2010 10:03:10|schedu|usorl03p430|E|scheduler tries to schedule job 51112.1 twice
> >>> 07/13/2010 10:03:25|worker|usorl03p430|E|scheduler tries to schedule job 51112.1 twice
> >>> 07/13/2010 10:03:25|worker|usorl03p430|W|Skipping remaining 8 orders
> >>> 07/13/2010 10:03:25|schedu|usorl03p430|E|scheduler tries to schedule job 51112.1 twice
> >>> 07/13/2010 10:03:40|worker|usorl03p430|E|scheduler tries to schedule job 51112.1 twice
> >>> 07/13/2010 10:03:40|worker|usorl03p430|W|Skipping remaining 8 orders
> >>> 07/13/2010 10:03:40|schedu|usorl03p430|E|scheduler tries to schedule job 51112.1 twice
> >>> 07/13/2010 10:03:55|worker|usorl03p430|E|scheduler tries to schedule job 51112.1 twice
> >>> 07/13/2010 10:03:55|worker|usorl03p430|W|Skipping remaining 8 orders
> >>> 07/13/2010 10:03:55|schedu|usorl03p430|E|scheduler tries to schedule job 51112.1 twice
> >>>
> >>> If I try and qdel this job I get the message that the job is already in deletion.  What can I do?  We have rebooted the master but that did not help.  6.2u5
> >>
> >> is the job still somewhere hanging around on a node?
> >>
> >> -- Reuti
> >>
> >>
> > Well, I was able to get it cleared by deleting that job from the jobs directory on the master and then restarting the master (thanks Sinisa) but I think it was a byproduct of a bigger issue.  The user is submitting 730 jobs all at once and the scheduler is getting overwhelmed.  As soon as this happens, those jobs and all others submitted after them pend and all qstat -j commands give job info and no scheduler info and say:
> >
> > Can not get job info messages, scheduler is not available
> >
> > Master has 32GB mem and 4 processors.  The master process is using 17 and has one cpu pegged at 100%.  Any tweaks I can do to resolve this?  Linux RHEL 5.4 SGE 6.2u5
> >>> ------------------------------------------------------
> >>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=267782
> >>>
> >>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=267811
> >
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=268397

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list