Opened 9 years ago

Closed 6 years ago

#788 closed defect (fixed)

IZ3250: Unable to submit job using advance reservation if h_rt is longer than 32999 seconds

Reported by: mhanby Owned by: Dave Love <d.love@…>
Priority: normal Milestone:
Component: sge Version: 6.2u5
Severity: minor Keywords: Linux scheduling
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3250]

        Issue #:      3250             Platform:     All      Reporter: mhanby (mhanby)
       Component:     gridengine          OS:        Linux
     Subcomponent:    scheduling       Version:      6.2u5       CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     andreas
          URL:        http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248081
       * Summary:     Unable to submit job using advance reservation if h_rt is longer than 32999 seconds
   Status whiteboard:
      Attachments:

     Issue 3250 blocks:
   Votes for issue 3250:


   Opened: Fri Mar 12 14:37:00 -0700 2010 
------------------------


See the discussion here:
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248081

GE 6.2u5
Linux x86_64
CentOS 5.4

I have created an advance reservation that has a duration of 225 hours and 59 minutes and 64 slots. The users in the ACL for the reservation
can submit jobs to using the reservation ID so long as their h_rt is 32999 seconds or less (9 hours 9 minutes and 59 seconds), any longer
than that and the qsub fails "Unable to run job: error: no suitable queues"

$ qrstat -ar 15
--------------------------------------------------------------------------------
id                             15
name                           testAR
owner                          mikeh
state                          r
start_time                     03/12/2010 13:00:00
end_time                       03/21/2010 23:59:00
duration                       225:59:00
submission_time                03/12/2010 11:36:36
group                          sge
account                        sge
granted_slots_list
all.q@compute-1-4.local=8,all.q@compute-0-8.local=8,all.q@compute-0-7.local=8,all.q@compute-0-3.local=8,all.q@compute-0-12.local=8,all.q@compute-0-10.local=8,all.q@compute-0-5.local=3,all.q@compute-0-6.local=8,all.q@compute-0-14.local=5
granted_parallel_environment   lam_loose_rsh slots 64
mail_options                   abe
acl_list                       mikeh,jdoe

Next we can see the 64 slots in the reservation:
$ qstat -g c
CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE
--------------------------------------------------------------------------------
all.q                             0.66    128     64     64    192      0      0

Now, try and submit two jobs using the reservation:

$ echo `/bin/hostname` | qsub -ar 15 -pe lam_loose_rsh 32 -l h_rt=09:09:59
Your job 111005 ("STDIN") has been submitted

$ echo `/bin/hostname` | qsub -ar 15 -pe lam_loose_rsh 32 -l h_rt=09:10:59
Unable to run job: error: no suitable queues.
Exiting.

It seems that I can submit jobs using this AR so long as the max runtime is less than 32999 seconds. Any job submission 33000 seconds or
longer fails.

If I submit the same jobs without specifying a reservation, they will both submit and run properly.

Reuti also confirmed this behavior, although he found a slightly different max:

=========================================
> I also tried using "h_rt=32999" and "h_rt=33000" with the same
> results.

Yep, I must confirm this. But for me the limit is 9:09:00, i.e. 32940.

-- Reuti
=========================================

Change History (5)

comment:1 Changed 6 years ago by markdixon

(Attempt to add a note to issue #788. Already tried to get it in via the website - let's see if I can convince the email gateway to do this...)

This bug is triggered when an advance reservation is 24 hours or longer in duration and a job is submitted with an h_rt request.

It is because function double_print_time_to_dstring results in a string of the format DAYS:HRS:MIN:SECS for times greater than a day, whereas function sge_parse_num_val only reads the first three fields and always interprets them as HRS:MIN:SECS. I note that the man page for sge_types only refers to 3-value time strings.

The original bug report stated that an advance reservation of 225 hours 59 minutes will not accept jobs longer than 9 hours 9 minutes 59 seconds. This is because the advance reservation duration can be represented as 9:9:59:0, which is later interpreted as 9:9:59. Gridengine will think anything over this value exceeds the length of the reservation.

I guess this can be fixed by one of:

  1. Teach sge_parse_num_val to interpret 4-value time strings, update the documentation and test for any other craziness
  2. Stop double_print_time_to_dstring from printing 4-value time strings

Looks like far, far greater scope for breakage by doing (1). (2) looks easiest. I'm completely ignoring the question of why gridengine does so many conversions internally.

Any thoughts about the "correct" way to fix this one?

Mark

comment:2 Changed 6 years ago by dlove

SGE <sge-bugs@…> writes:

I guess this can be fixed by one of:

  1. Teach sge_parse_num_val to interpret 4-value time strings, update the documentation and test for any other craziness
  2. Stop double_print_time_to_dstring from printing 4-value time strings

Looks like far, far greater scope for breakage by doing (1).

Yes, will do. Thanks and well spotted! The days field in the
human-readable output had never registered; changing that will be a
minor incompatibility, but will make it agree with the doc.

comment:3 Changed 6 years ago by markdixon

  • Severity set to minor

test

comment:4 Changed 6 years ago by markdixon

On Thu, 24 Jan 2013, Dave Love wrote:
...

Yes, will do. Thanks and well spotted! The days field in the
human-readable output had never registered; changing that will be a
minor incompatibility, but will make it agree with the doc.

...

I'm happy to prepare the patch for (2) and check that it fixes this bug, if that's helpful?

Mark
--


Mark Dixon Email : m.c.dixon@…
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK


comment:5 Changed 6 years ago by Dave Love <d.love@…>

  • Owner set to Dave Love <d.love@…>
  • Resolution set to fixed
  • Status changed from new to closed

In 4437/sge:

Fix #788: Stop double_print_time_to_dstring printing 4-value time strings
Thanks to Mark Dixon.
Possible inompatibility due to changes in output of qstat etc. to
agree with sge_types.

Note: See TracTickets for help on using tickets.