Custom Query (431 matches)

Filters
 
Or
 
  
 
Columns

Show under each result:


Results (79 - 81 of 431)

Ticket Resolution Summary Owner Reporter
#716 fixed IZ3132: Job validation behavour changed since 6.0 / 6.1 Dave Love <d.love@…> ccaamad
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3132]

        Issue #:      3132             Platform:     All      Reporter: ccaamad (ccaamad)
       Component:     gridengine          OS:        All
     Subcomponent:    clients          Version:      6.2u3       CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    roland (roland)
      QA Contact:     roland
          URL:
       * Summary:     Job validation behavour changed since 6.0 / 6.1
   Status whiteboard:
      Attachments:

     Issue 3132 blocks:
   Votes for issue 3132:


   Opened: Fri Sep 11 08:57:00 -0700 2009 
------------------------


I'm trying to migrate from 6.0 to 6.2u3 and discovered that job validation behaviour has changed: it rejects
jobs if queues are disabled!

e.g.

$ qsub -w e test.sh
Your job 162 ("serial.sh") has been submitted
$ qmod -d '*'
$ qsub -w e test.sh
Unable to run job: error: no suitable queues.
Exiting.

Also, reuti commented:

> I can only confirm this, and IMO it as a bug, as also submissions to
> calendar disabled queues are handled this way, hence -w e can't be
> used with calendars any longer when the queue is disabled right now.

Many thanks,

Mark

   ------- Additional comments from petrik Mon Oct 5 03:03:28 -0700 2009 -------
Issue will be resolved as part of 3138.

   ------- Additional comments from joga Mon Oct 12 06:49:46 -0700 2009 -------
*** Issue 3138 has been marked as a duplicate of this issue. ***
#767 fixed IZ3220: exclusive host access prevents resource reservation for waiting jobs ccaamad
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3220]

        Issue #:      3220             Platform:     PC       Reporter: ccaamad (ccaamad)
       Component:     gridengine          OS:        Linux
     Subcomponent:    scheduling       Version:      6.2u4       CC:    None defined
        Status:       NEW              Priority:     P2
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     andreas
          URL:
       * Summary:     exclusive host access prevents resource reservation for waiting jobs
   Status whiteboard:
      Attachments:

     Issue 3220 blocks:
   Votes for issue 3220:


   Opened: Mon Jan 11 08:31:00 -0700 2010 
------------------------


If there are jobs running with exclusive=true set, those slots are removed from consideration for waiting jobs by resource reservation.
This makes the "exclusive" feature useless to me. I wanted to use it to pack each parallel job onto the minimum number of hosts.

Look at "qstat -g c". Add-up the numbers in the "TOTAL" column. Subtract the numbers in the "cdsuE" column. Subtract the number of slots
belonging to queue instances with a host-exclusive job in them. The number you are left with is the biggest parallel job which will have
resources reserved for it. Any bigger will be starved by any waiting smaller jobs.

e.g.

Create a test cluster with a single queue and four 8-slot exec hosts. Enable exclusive job scheduling on all hosts. For illustration
purposes, disable one of the queue instances:

$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
smp.q@smp1.arc1.leeds.ac.uk    BIP   0/0/8          0.00     lx24-amd64
---------------------------------------------------------------------------------
smp.q@smp2.arc1.leeds.ac.uk    BIP   0/0/8          0.00     lx24-amd64
---------------------------------------------------------------------------------
smp.q@smp3.arc1.leeds.ac.uk    BIP   0/0/8          0.00     lx24-amd64
---------------------------------------------------------------------------------
smp.q@smp4.arc1.leeds.ac.uk    BIP   0/0/8          0.00     lx24-amd64    d


Submit a 14-slot host-exclusive job, and an ordinary 1-slot job:

$ qsub -clear -cwd -l h_rt=1:0:0,exclusive=true -R y -pe mpi 14 wait.sh
Your job 45 ("wait.sh") has been submitted
$ qsub -clear -cwd -l h_rt=1:0:0 -R y wait.sh
Your job 49 ("wait.sh") has been submitted
$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
smp.q@smp1.arc1.leeds.ac.uk    BIP   0/1/8          0.00     lx24-amd64
     49 0.50500 wait.sh    issmcd       r     01/11/2010 15:02:48     1
---------------------------------------------------------------------------------
smp.q@smp2.arc1.leeds.ac.uk    BIP   0/8/8          0.00     lx24-amd64
     45 0.60500 wait.sh    issmcd       r     01/11/2010 14:59:24     8
---------------------------------------------------------------------------------
smp.q@smp3.arc1.leeds.ac.uk    BIP   0/6/8          0.00     lx24-amd64
     45 0.60500 wait.sh    issmcd       r     01/11/2010 14:59:24     6
---------------------------------------------------------------------------------
smp.q@smp4.arc1.leeds.ac.uk    BIP   0/0/8          0.00     lx24-amd64    d


Submit an 8-slot and a 9-slot job:

$ qsub -clear -cwd -l h_rt=1:0:0 -R y -pe mpi 8 wait.sh
Your job 50 ("wait.sh") has been submitted
$ qsub -clear -cwd -l h_rt=1:0:0 -R y -pe mpi 9 wait.sh
Your job 51 ("wait.sh") has been submitted

If I have MONITOR=true on in the scheduler configuration, I can see that only the 8-slot job has resources reserved for it. The 9-slot job
is left to starve.
#795 fixed IZ3257: execd 'job exceeds job hard limit' message should include task id as well as job id. ccaamad
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3257]

        Issue #:      3257             Platform:     PC            Reporter: ccaamad (ccaamad)
       Component:     gridengine          OS:        Linux
     Subcomponent:    execution        Version:      6.2u5            CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    ENHANCEMENT
                                   Target milestone: ---
      Assigned to:    pollinger (pollinger)
      QA Contact:     pollinger
          URL:
       * Summary:     execd 'job exceeds job hard limit' message should include task id as well as job id.
   Status whiteboard:
      Attachments:

     Issue 3257 blocks:
   Votes for issue 3257:


   Opened: Wed Mar 31 01:21:00 -0700 2010 
------------------------


Looking at the execd messages file is a valuable way to understand why a job has unexpectedly ended - in particular, messages similar to:

03/17/2010 17:02:37|  main|c3s0b11n0|W|job 10657 exceeds job hard limit "h_vmem" of queue "c3s0.q@c3s0b11n0.arc1.leeds.ac.uk"
(4195127296.00000 > limit:4194304000.00000) - sending SIGKILL

However, these messages do not currently include the task id of the job, making it difficult to track-down what has happened to array jobs.
As there may be several thousand tasks with the same job id, with many running simultaneously on the same host, making it easy to parse logs
and see what happened to them is rather useful!

Looking at the source, the following messages are defined in gridengine/source/daemons/execd/msg_execd.h, lines 217 and 218:

#define MSG_JOB_EXCEEDHLIM_USSFF      _MESSAGE(29126, _("job "sge_U32CFormat" exceeds job hard limit "SFQ" of queue "SFQ" (%8.5f >
limit:%8.5f) - sending SIGKILL"))
#define MSG_JOB_EXCEEDSLIM_USSFF      _MESSAGE(29127, _("job "sge_U32CFormat" exceeds job soft limit "SFQ" of queue "SFQ" (%8.5f >
limit:%8.5f) - sending SIGXCPU"))

And used in gridengine/source/daemons/execd/execd_ck_to_do.c lines 277-293.

At the point where these message are generated, there's a "jataskid" variable in-scope which looks like it might include what's needed.
Could the messages be extended to include this information, please?

Thanks,

Mark

   ------- Additional comments from reuti Wed Mar 31 03:12:21 -0700 2010 -------
For s_rt/h_rt it's already working this way. Looks like this message is created elsewhere.

   ------- Additional comments from ccaamad Wed Mar 31 03:58:14 -0700 2010 -------
That's right. h_rt/s_rt messages include the task id and are defined by lines 219/220 of msg_execd.h:

#define MSG_EXECD_EXCEEDHWALLCLOCK_UU _MESSAGE(29128, _("job "sge_U32CFormat"."sge_U32CFormat" exceeded hard wallclock time - initiate
terminate method"))
#define MSG_EXECD_EXCEEDSWALLCLOCK_UU _MESSAGE(29129, _("job "sge_U32CFormat"."sge_U32CFormat" exceeded soft wallclock time - initiate soft
notify method"))

And used by lines 455 and 474 of execd_ck_to_do.c.

We just need the 'exceeds job (hard|soft) limit' message to include the task id information as well. I'd include a simple patch but I'm not
yet geared-up to rebuilding grid engine and I don't want to offer something that isn't tested.

This would really be a big help - some of our users submit task arrays where >95% of tasks need <1G of memory and <5% need >4G. It aids
throughput to ask them to request 1G for the job and then resubmit those tasks that fail. Changing the message would aid identification of
what tasks have failed and why.

Thanks,

Mark
Note: See TracQuery for help on using queries.