Opened 12 years ago
Last modified 10 years ago
#669 new defect
IZ3021: problem with queue assignments and PE jobs
Reported by: | johnfol | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 6.2u2 |
Severity: | minor | Keywords: | scheduling |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3021]
Issue #: 3021 Platform: All Reporter: johnfol (johnfol) Component: gridengine OS: All Subcomponent: scheduling Version: 6.2u2 CC: None defined Status: NEW Priority: P1 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: andreas (andreas) QA Contact: andreas URL: * Summary: problem with queue assignments and PE jobs Status whiteboard: Attachments: Issue 3021 blocks: Votes for issue 3021: 1 Opened: Thu May 7 13:14:00 -0700 2009 ------------------------ OK, I've answered my question myself, I think.... I've turned on debugging and can see what's happening with this problem. When a non-pe job is submitted, the queues are sorted correctly in the .../source/libs/sched/sge_select_queue.c file - it appears that this occurs in the "sequential_update_host_order" function of that file. However, I don't see this happening in the case of a pe job (which calls the "sge_select_parallel_environment" routine instead of the "sge_sequential_assignment" routine). There's been a fair amount of work in this file in the past few months, mainly by "roland" - so I guess I'd ask if you could take a quick look and see if there's anything "obvious" that's missing here. (I could, of course, start hacking around in the code, but not being familiar with the whole structure I'm sure I'd do more damage than good.....). Also, just as a reference, the debugging output below shows the problem directly. A pe job was submitted, and I captured the section of the debug output where the queue selection is being made. (The actual qsub command was "qsub -shell no -cwd -V -b y -p -512 -N xfdtd -A test \ -pe standard_pe 4 -q primary@@*,,secondary@@* \ /appl/sun/grid_engine/site_PCSRL/scripts/start_xfdtd_701 4"). You can see on line 26976 that the queue "primary@lxdel10.srl.css.mot.com" has been examined, but it's not in the hard queue list that I specified on the command line, so it's rejected (that's correct). However, right below that, on line 26985, the queue "secondary@lxdel10.srl.css.mot.com" is examined, and it eventually is deemed OK and assigned. *It should not have been examined at this point, however ! * All the primary queues should have been examined before any of the secondary queues, because the primary queue has a lower seq_number than the secondary queue. I will look into "signing in" and submitting an official bug report later this afternoon. However, (and unfortunately), the entire design of my SGE system relies on this to work as expected - so I really need to get this fixed (or some easy workaround) as soon as possible. I'm more than willing to compile in a fix until an official release can be made, but I would greatly appreciate any pointers as to what to add/modify to make this work as desired in the meantime. (It's also possible, of course, that I'm still doing something wrong, but no one has mentioned anything yet about this working as designed, so I'm assuming at this point that it's a bug in the code.) Thanks to the authors for any help you can send my way ! John > 26976 14997 scheduler000 Queue "primary@lxdel10.srl.css.mot.com" is not contained in the hard queue list (-q) that was requested by job 214 > 26977 14997 scheduler000 --> schedd_mes_add() { > 26978 14997 scheduler000 --> sge_schedd_text() { > 26979 14997 scheduler000 <-- sge_schedd_text() ../libs/sched/sge_schedd_text.c 421 } > 26980 14997 scheduler000 <-- schedd_mes_add() ../libs/sched/schedd_message.c 412 } > 26981 14997 scheduler000 <-- sge_queue_match_static() ../libs/sched/sge_select_queue.c 1622 } > 26982 14997 scheduler000 parallel_queue_slots(primary@lxdel10.srl.css.mot.com) returns <error> > 26983 14997 scheduler000 <-- parallel_queue_slots() ../libs/sched/sge_select_queue.c 5092 } > 26984 14997 scheduler000 HOST(1.5) lxdel10.srl.css.mot.com will get us nothing > 26985 14997 scheduler000 checking queue secondary@lxdel10.srl.css.mot.com because cqueue secondary is not rejected > 26986 14997 scheduler000 --> parallel_queue_slots() { > 26987 14997 scheduler000 --> sge_queue_match_static() { John Foley wrote: > Well, I turned on the PROFILE and MONITOR params for the scheduler, so > my now qconf -msconf looks like this: > >> algorithm default >> schedule_interval 0:0:15 >> maxujobs 0 >> queue_sort_method seqno >> job_load_adjustments np_load_avg=0.50 >> load_adjustment_decay_time 0:7:30 >> load_formula np_load_avg >> schedd_job_info true >> flush_submit_sec 0 >> flush_finish_sec 0 >> params PROFILE=1,MONITOR=1 >> reprioritize_interval 0:0:0 >> halftime 168 >> usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000 >> compensation_factor 5.000000 >> weight_user 0.250000 >> weight_project 0.250000 >> weight_department 0.250000 >> weight_job 0.250000 >> weight_tickets_functional 0 >> weight_tickets_share 0 >> share_override_tickets TRUE >> share_functional_shares TRUE >> max_functional_jobs_to_schedule 200 >> report_pjob_tickets TRUE >> max_pending_tasks_per_job 50 >> halflife_decay_list none >> policy_hierarchy OFS >> weight_ticket 0.000000 >> weight_waiting_time 0.000000 >> weight_deadline 3600000.000000 >> weight_urgency 0.000000 >> weight_priority 400.000000 >> max_reservation 0 >> default_duration INFINITY > > > > however, the common/schedule file just gets a bunch of ":"s in it - nothing else. > The spool/qmaster/messages file gets a bunch of stuff, mostly timing details, > it looks like. I couldn't decifer anything in there that looks like the "why" > behind the scheduler's decisions. Is there any other way to get debug output > from the scheduler ? > > > > John Foley wrote: > >> Well, it certainly *looks* like my issue, but either it's not the >> issue I'm seeing, or it really wasn't fixed in 6.2u2 (as Richard >> mentioned). >> >> I tried the workaround of raising the sequence numbers of the >> queues to very high numbers (1500 and 3000) but still see the >> same thing. >> >> To answer Daniel's question, I checked to make sure (!) and >> yes, the standard_pe is referenced in both the primary and >> secondary queues. >> >> So, next question is, I guess, is there any debugging or other >> option that can be turned on to figure out why the scheduler is >> making this decision ? From looking at the sge_conf man page, >> it looks like this is possible using the qconf -msconf command >> and modifying the "params" field, but I thought I'd better >> check here first to see if that's the best way to do it (or if >> that actually will accomplish what I'm looking for). If this >> is the best way, could someone show an example of using that >> command ? >> >> Thanks, >> >> John >> >> >> rems0 wrote: >> >>> olesen wrote: >>> >>>> It looks to me like you are hitting this issue: >>>> >>>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=2864 >>> >>> >>> >>> >>> >>> I'm also hitting this issue, but I'm using GE 6.2u1. >>> John is using GE 6.2u2, and this issue status is marked as RESOLVED and >>> FIXED for 6.2u2 ! >>> Apparently it's not fixed at all! Or did I misunderstood the "Target >>> milestone" tag? >>> Andreas? >>> >>> Thanks, Richard >>> >>> >>> >> >> >> > > > -- ########################################################################### # John Foley # Location: IL93-E1-21S # # IT & Systems Administration # Maildrop: IL93-E1-35O # # Antenna & Mechanical Simulation Grp # Email: john.foley@motorola.com # # Motorola, Inc. - Mobile Devices # Phone: (847) 523-8719 # # 600 North US Highway 45 # Fax: (847) 523-5767 # # Libertyville, IL. 60048 (USA) # Cell: (847) 460-8719 # ########################################################################### (this email sent using Mozilla on Windows)
Change History (1)
comment:1 Changed 10 years ago by dlove
- Priority changed from highest to normal
- Severity set to minor
Note: See
TracTickets for help on using
tickets.