Opened 11 years ago
Closed 8 years ago
#788 closed defect (fixed)
IZ3250: Unable to submit job using advance reservation if h_rt is longer than 32999 seconds
Reported by: | mhanby | Owned by: | Dave Love <d.love@…> |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 6.2u5 |
Severity: | minor | Keywords: | Linux scheduling |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3250]
Issue #: 3250 Platform: All Reporter: mhanby (mhanby) Component: gridengine OS: Linux Subcomponent: scheduling Version: 6.2u5 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: andreas (andreas) QA Contact: andreas URL: http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248081 * Summary: Unable to submit job using advance reservation if h_rt is longer than 32999 seconds Status whiteboard: Attachments: Issue 3250 blocks: Votes for issue 3250: Opened: Fri Mar 12 14:37:00 -0700 2010 ------------------------ See the discussion here: http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248081 GE 6.2u5 Linux x86_64 CentOS 5.4 I have created an advance reservation that has a duration of 225 hours and 59 minutes and 64 slots. The users in the ACL for the reservation can submit jobs to using the reservation ID so long as their h_rt is 32999 seconds or less (9 hours 9 minutes and 59 seconds), any longer than that and the qsub fails "Unable to run job: error: no suitable queues" $ qrstat -ar 15 -------------------------------------------------------------------------------- id 15 name testAR owner mikeh state r start_time 03/12/2010 13:00:00 end_time 03/21/2010 23:59:00 duration 225:59:00 submission_time 03/12/2010 11:36:36 group sge account sge granted_slots_list all.q@compute-1-4.local=8,all.q@compute-0-8.local=8,all.q@compute-0-7.local=8,all.q@compute-0-3.local=8,all.q@compute-0-12.local=8,all.q@compute-0-10.local=8,all.q@compute-0-5.local=3,all.q@compute-0-6.local=8,all.q@compute-0-14.local=5 granted_parallel_environment lam_loose_rsh slots 64 mail_options abe acl_list mikeh,jdoe Next we can see the 64 slots in the reservation: $ qstat -g c CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE -------------------------------------------------------------------------------- all.q 0.66 128 64 64 192 0 0 Now, try and submit two jobs using the reservation: $ echo `/bin/hostname` | qsub -ar 15 -pe lam_loose_rsh 32 -l h_rt=09:09:59 Your job 111005 ("STDIN") has been submitted $ echo `/bin/hostname` | qsub -ar 15 -pe lam_loose_rsh 32 -l h_rt=09:10:59 Unable to run job: error: no suitable queues. Exiting. It seems that I can submit jobs using this AR so long as the max runtime is less than 32999 seconds. Any job submission 33000 seconds or longer fails. If I submit the same jobs without specifying a reservation, they will both submit and run properly. Reuti also confirmed this behavior, although he found a slightly different max: ========================================= > I also tried using "h_rt=32999" and "h_rt=33000" with the same > results. Yep, I must confirm this. But for me the limit is 9:09:00, i.e. 32940. -- Reuti =========================================
Change History (5)
comment:1 Changed 8 years ago by markdixon
comment:2 Changed 8 years ago by dlove
SGE <sge-bugs@…> writes:
I guess this can be fixed by one of:
- Teach sge_parse_num_val to interpret 4-value time strings, update the documentation and test for any other craziness
- Stop double_print_time_to_dstring from printing 4-value time strings
Looks like far, far greater scope for breakage by doing (1).
Yes, will do. Thanks and well spotted! The days field in the
human-readable output had never registered; changing that will be a
minor incompatibility, but will make it agree with the doc.
comment:4 Changed 8 years ago by markdixon
On Thu, 24 Jan 2013, Dave Love wrote:
...
Yes, will do. Thanks and well spotted! The days field in the
human-readable output had never registered; changing that will be a
minor incompatibility, but will make it agree with the doc.
...
I'm happy to prepare the patch for (2) and check that it fixes this bug, if that's helpful?
Mark
--
Mark Dixon Email : m.c.dixon@…
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
comment:5 Changed 8 years ago by Dave Love <d.love@…>
- Owner set to Dave Love <d.love@…>
- Resolution set to fixed
- Status changed from new to closed
In 4437/sge:
(Attempt to add a note to issue #788. Already tried to get it in via the website - let's see if I can convince the email gateway to do this...)
This bug is triggered when an advance reservation is 24 hours or longer in duration and a job is submitted with an h_rt request.
It is because function double_print_time_to_dstring results in a string of the format DAYS:HRS:MIN:SECS for times greater than a day, whereas function sge_parse_num_val only reads the first three fields and always interprets them as HRS:MIN:SECS. I note that the man page for sge_types only refers to 3-value time strings.
The original bug report stated that an advance reservation of 225 hours 59 minutes will not accept jobs longer than 9 hours 9 minutes 59 seconds. This is because the advance reservation duration can be represented as 9:9:59:0, which is later interpreted as 9:9:59. Gridengine will think anything over this value exceeds the length of the reservation.
I guess this can be fixed by one of:
Looks like far, far greater scope for breakage by doing (1). (2) looks easiest. I'm completely ignoring the question of why gridengine does so many conversions internally.
Any thoughts about the "correct" way to fix this one?
Mark