[GE users] V6 scheduler woes

McCalla, Mac macmccalla at hess.com
Thu May 12 15:53:38 BST 2005


attached are output of qconf -ssconf command, and sections of
schedd/messages and qmaster messages files
covering  time period from about 08:00am to 08:30 this morning.   

 during this period a top command run on the qmaster system showed
PID       USER     PRI NI  SIZE  RSS  SHARE STAT %CPU %MEM  TIME  
26563     sgeadmin  25  0  459M  428M  1748 R    24.9 11.0  619:
26560     sgeadmin  25  0 2958M  2.8G  2028 R    13.2 73.1  282:
26559     sgeadmin  25  0 2958M  2.8G  2028 S    13.8 73.1  253:

this is a dual Xeon 3.2 GHz with HT on and 4GB of memory running Linux
RHEL3 update 4.  %CPU maxes out
at 25% for a single process.  

Overall what I would I have been trying to do is simply duplicate the
way our SGE v5.3p6 system
worked from a scheduling viewpoint, with user_sort set to true.   

Regards,

Mac McCalla  

-----Original Message-----
From: Stephan Grell - Sun Germany - SSG - Software Engineer
[mailto:stephan.grell at sun.com] 
Sent: Thursday, May 12, 2005 3:17 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] V6 scheduler woes

Hello,

I am having still problem understanding the issue. Could you please post

your scheduling
configuration and run the scheduler with profiling on? Both, the 
scheduler configuration
and the profiling output would be very usefull do understand what is 
going on.

You can enable profiling on the fly without restarting anything. Its 
output will be written
to the scheduler message file.

Thanks,
Stephan

McCalla, Mac wrote:

> There are about 23000 jobs pending.  About 650 jobs have finished in
>the last 30 minutes.
> The idea that setting the flush* variables to zero causes the
scheduler
>to run continuously 
> originally came from the v5.3 man page sge_conf.  Perhaps this has
>impeded the understanding
> of the settings in v6 for me....8~)
>
> I will open an issue if I duplicate the problem .
>
> Thanks
>
>   
> 
>
>-----Original Message-----
>From: Reuti [mailto:reuti at staff.uni-marburg.de] 
>Sent: Wednesday, May 11, 2005 2:57 PM
>To: users at gridengine.sunsource.net
>Subject: RE: [GE users] V6 scheduler woes
>
>Hi,
>
>Quoting "McCalla, Mac" <macmccalla at hess.com>:
>
>  
>
>>Thanks for replying Reuti,
>>
>>I believe the setting of the 2 flush_* variables to zero causes the
>>scheduler to run continuously.  This was the intent since it is
>>    
>>
>running
>
>this shouldn't be, it should behave like the documentation said: run x
>seconds 
>after the events (submit/end of job) - or not to be triggered by this
>events in 
>case of the setting 0. Are there many jobs submitted and finishing?
>
>  
>
>>on a dedicated server.      This setting however,
>>in turn apparently causes the output from qconf -tsm to not terminate,
>>    
>>
>a
>  
>
>>situation that might have
>>dire consequences if it fills up the file system.  BTW, cycling the
>>qmaster/schedd did have the desired
>>effect of stopping the output from qconf -tsm.    If this is expected
>>behavior from qconf -tsm, then fine,
>>    
>>
>
>No, it should trace one run of the scheduler. If you see it running
>forver, I 
>would say it's an issue. - Reuti
>
>  
>
>>this is just a cautionary note.  If not, I will be glad to open it as
>>    
>>
>a
>  
>
>>problem.
>>
>>Regards,
>>
>>Mac   
>>
>>-----Original Message-----
>>From: Reuti [mailto:reuti at staff.uni-marburg.de] 
>>Sent: Wednesday, May 11, 2005 1:11 PM
>>To: users at gridengine.sunsource.net
>>Subject: Re: [GE users] V6 scheduler woes
>>
>>Hi,
>>
>>I don't get the point yuo want to achieve. The details for
>>"flush_submit_sec" 
>>and "flush_finish_sec" are explained in "man sched_conf", it triggers
>>    
>>
>a 
>  
>
>>scheduler run after these events.
>>
>>CU- Reuti
>>
>>
>>Quoting "McCalla, Mac" <macmccalla at hess.com>:
>>
>>    
>>
>>>Hello all,
>>>
>>>	I am running v6.0u4beta downloaded and built Apr 26 for a RHEL 3
>>>update 4 Linux .   In trying to diagnose why some user jobs were not
>>>being scheduled I issued the qconf -tsm command .  I also had
>>>flush_submit_sec and flush_finish_sec set to 0 .  The result appears
>>>      
>>>
>>to
>>    
>>
>>>be that
>>>the schedd_runlog file is being written to continuously for the last
>>>      
>>>
>>1.5
>>    
>>
>>>hours.  I have set flush_submit_sec and flush_finish_sec to 30 now
>>>      
>>>
>but
>  
>
>>>without apparent effect.  Anyone know how to turn this off short of
>>>bouncing the qmaster? (assuming that will do it?).  Thanks.
>>> 
>>>Mac McCalla 
>>>Geoscience Systems Consultant
>>>Amerada Hess Corporation
>>>500 Dallas St. , Houston, Texas  77002
>>>
>>>
>>>      
>>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>    
>>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




    [ Part 2, "ssconf.log.gz"  Application/X-GZIP (Name: "ssconf.log.gz") ]
    [ 537 bytes. ]
    [ Unable to print this part. ]


    [ Part 3, "qmaster_messages.log.gz"  Application/X-GZIP (Name: ]
    [ "qmaster_messages.log.gz") 11 KB. ]
    [ Unable to print this part. ]


    [ Part 4, "schedd_messages.log.gz"  Application/X-GZIP (Name: ]
    [ "schedd_messages.log.gz") 1.8 KB. ]
    [ Unable to print this part. ]


    [ Part 5: "Attached Text" ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list