[GE users] accounting and parallel jobs

templedf dan.templeton at sun.com
Thu Nov 12 13:21:41 GMT 2009


The other way would be to submit an MPI job that burns CPU cycles for a 
minute or two.  When it's done, check the qacct -j  <jobid> accounting 
log for it.  If the tight integration is working, you should see that it 
consumed more CPU time than wallclock time.

Daniel

reuti wrote:
> Hi,
>
> Am 12.11.2009 um 13:31 schrieb mlmersel:
>
>   
>> I am using mpich parallel libs. I followed the directions in your  
>> how-to
>> on tight-integration.
>>
>>
>> My configuration looks like this:
>>
>> pe_name           mpich
>> slots             999
>> user_lists        NONE
>> xuser_lists       NONE
>> start_proc_args   /storage/SGE-6.1u4/mpi/startmpi.sh  -unique - 
>> catch_rsh \
>>     
>
> why are you using -unique here? When you get an uneven distribution  
> of slots, this might result in an call to the wrong node, and might  
> be blocked by SGE in qrsh -inherit command.
>
>   
>>                   $pe_hostfile
>> stop_proc_args    /storage/SGE-6.1u4/mpi/stopmpi.sh
>> allocation_rule   $round_robin
>> control_slaves    TRUE
>> job_is_first_task TRUE
>> urgency_slots     min
>>     
>
> ps -e f
>
> (f w/o -) will give you a nice presentation of the running job.
>
> In case something like ssh is compiled into you application, you will  
> need:
>
> export P4_RSHCOMMAND=rsh
>
> in your jobscript.
>
> -- Reuti
>
>
>   
>> How can I check if tight integration is really being implemented.
>>
>>
>>          Thank you,
>>             Jerry
>>
>>
>> <quote who="reuti">
>>     
>>> Hi,
>>>
>>> Am 10.11.2009 um 09:50 schrieb mlmersel:
>>>
>>>       
>>>> Hi Reuti:
>>>>
>>>>  I am using 6.1U4, tight integration.
>>>>         
>>> can you be more specific? What parallel lib are you using with which
>>> startup method and what did you do to achieve a tight integration?
>>> Did you monitor the running job on the nodes, so that they all got
>>> the additional group id attached? Did you check also a single job
>>> with "qacct -j <id>"?
>>>
>>> -- Reuti
>>>
>>>
>>>       
>>>>                          Best,
>>>>                            Jerry
>>>>
>>>> <quote who="reuti">
>>>>         
>>>>> Am 09.11.2009 um 12:59 schrieb mlmersel:
>>>>>
>>>>>           
>>>>>> and the cpu time?
>>>>>>             
>>>>> For Tightly Intergrated jobs you will get several entries in  
>>>>> `qacct`,
>>>>> unless you specify "accounting_summary TRUE" in the PE  
>>>>> configuration.
>>>>>
>>>>> This is the recorded time of the CPU usage. This can be changed  
>>>>> to be
>>>>> reserved time (in `qconf -mconf`).
>>>>>
>>>>> There was a bug in 6.2 which was fixed in 6.2u1, when the builtin
>>>>> method killed the slaves too early and their entries were  
>>>>> completely
>>>>> missing. Which version are you using and which method to invoke the
>>>>> slaves?
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>           
>>>>>> <quote who="reuti">
>>>>>>             
>>>>>>> Am 09.11.2009 um 09:20 schrieb mlmersel:
>>>>>>>
>>>>>>>               
>>>>>>>> It is tightly integrated.
>>>>>>>>
>>>>>>>> <quote who="fy">
>>>>>>>>                 
>>>>>>>>> Jerry
>>>>>>>>>
>>>>>>>>> Is you parallel environment tightly-integrated?
>>>>>>>>> Loose integration is one reason for low cpu usage in  
>>>>>>>>> accounting.
>>>>>>>>> see:
>>>>>>>>> http://gridengine.sunsource.net/howto/howto.html#Tight%
>>>>>>>>> 20Integration%20of%20Parallel%20Libraries
>>>>>>>>>                   
>>>>>>> Wallclock is just the wallclock w/o multipication by slots.
>>>>>>>
>>>>>>> -- Reuti
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>>> cheers
>>>>>>>>> Fred Youhanaie
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 08/11/09 13:09, mlmersel wrote:
>>>>>>>>>                   
>>>>>>>>>> Hi:
>>>>>>>>>>
>>>>>>>>>>   We have a group of users who have their own queue and run
>>>>>>>>>> almost
>>>>>>>>>> exclusively parallel jobs. The problem is when I calculate the
>>>>>>>>>> utilization per month (wall clock time / (secs in month *  
>>>>>>>>>> cores)
>>>>>>>>>> I get
>>>>>>>>>> a ridiculously small numbers 1%,2%,3%. I know this can't be
>>>>>>>>>> correct.
>>>>>>>>>>
>>>>>>>>>> Is their a problem with the accounting when running parallel
>>>>>>>>>> jobs?
>>>>>>>>>> I am using gridengine 6.1u4.
>>>>>>>>>>
>>>>>>>>>>                         Thanks,
>>>>>>>>>>                           Jerry
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------
>>>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>>>>> dsForumId=38&dsMessageId=225643
>>>>>>>>>>
>>>>>>>>>> To unsubscribe from this discussion, e-mail:
>>>>>>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------
>>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>>>> dsForumId=38&dsMessageId=225648
>>>>>>>>>
>>>>>>>>> To unsubscribe from this discussion, e-mail:
>>>>>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> ------------------------------------------------------
>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>>> dsForumId=38&dsMessageId=225782
>>>>>>>>
>>>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>>
>>>>>>>>                 
>>>>>>> ------------------------------------------------------
>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>>> dsForumId=38&dsMessageId=225802
>>>>>>>
>>>>>>> To unsubscribe from this discussion, e-mail:
>>>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>>>
>>>>>>>               
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>> dsForumId=38&dsMessageId=225804
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>
>>>>>>             
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>> dsForumId=38&dsMessageId=225810
>>>>>
>>>>> To unsubscribe from this discussion, e-mail:
>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>>           
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=225967
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>>         
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>> dsForumId=38&dsMessageId=225971
>>>
>>> To unsubscribe from this discussion, e-mail:
>>> [users-unsubscribe at gridengine.sunsource.net].
>>>
>>>       
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=226431
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>
>>     
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=226434
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=226437

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list