[GE users] job suspension

Viktor Oudovenko udo at physics.rutgers.edu
Thu May 22 15:56:58 BST 2008


Hi, Ravi,

As you see in my example provided in my previous e-mail  the job of user2
started at 22:13 and job of user1  was started/submitted at 2:06 next day.
Job in queue _lp was running and this morning one more used submitted job
and it shows that _lp job was suspended but indeed it was not (I see if from
queue load which is nearly "2" instead of "1".

Any other ideas.

Regards,
v



> -----Original Message-----
> From: Ravichandra.Nallan at Sun.COM [mailto:Ravichandra.Nallan at Sun.COM] 
> Sent: Thursday, May 22, 2008 2:41
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] job suspension
> 
> Hi,
>     When the situation occurs, was the job into the _lp queue 
> recently submitted?
> I want to know the time between the submission of job in _lp 
> queue and the job submission in wparallel3, coz there was a 
> race condition when the job on _lp is just starting up and 
> the _lp queue is suspended because of job submitted in wparallel3.
> 
> check this:
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=2478
> 
> regards,
> ~Ravi
> 
> Viktor Oudovenko wrote:
> > Hello to everybody,
> >  
> > any ideas why job suspension does not work?
> >  
> > I have SGE 6.0u4 running on dual Athlon server.
> > Job is parallel (tight integration).
> > Queue status correctly changes to "S" but job continue to 
> run (so both 
> > jobs continue to run).
> > plz see below:
> >  
> >  185157 5.00400 mpi_p1       user1        r     05/22/2008 02:06:11 
> > wparallel1 at sub04n103 <mailto:wparallel1 at sub04n103>          
>     64       
> >  185081 2.01388 mpi_p2       user2        S     05/21/2008 22:13:02 
> > wparallel1_lp at sub04n103 <mailto:wparallel1_lp at sub04n103>           
> > 64       
> >  
> > So , queue with "wparallel3_lp" (low priority) is defined as 
> > subordinated queue of  wparallel3.
> >  
> > it seems to me when I created queue "_lp" and tested  job 
> suspension 
> > under my account is worked on x86 architecture and did not work on 
> > opterons but now it does not work even on x86 machines.
> >  
> > I found info in the net that 6.0u4 does have bug that after 
> sgemaster 
> > restart jobs are not suspended but I have not restarted the master 
> > rather only computed nodes.
> >  
> > If any questions plz let me know.
> > any ideas are welcome.
> >  
> > best,
> > vic
> > p.s.
> > CLUSTER QUEUE    CQLOAD   USED  AVAIL  TOTAL aoACDS  cdsuE 
> > 
> --------------------------------------------------------------
> ----------------- 
> > wparallel1                          1.82     64      0     64      
> > 0      0
> > wparallel1_lp                     1.82     64      0     64 
>     64      0
> >  
> >  
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list