[GE users] intensive job

Reuti reuti at staff.uni-marburg.de
Sun Oct 26 21:15:53 GMT 2008


Am 26.10.2008 um 20:57 schrieb Mag Gam:

> Reuti:
>
> You are right! I did have a memory limit. I removed it and his
> application works! Thankyou very much.
>
> Since these are intensive processes, we want to run only 1 process per
> host. To be safe, we can even wait for the process to complete and
> then submit a subtask. Is it possible to do that?

There are two options:

-) If you have just these type of jobs you could define the queue  
having only one slot per machine (entry "slot" in the queue  
definition). This way all can be submitted, and start only one after  
another on each machine.

-) If also other jobs should run there: implement virtual_free or  
h_vmem to be consumable and request the proper amount like I  
mentioned in my first reply. When the memory is used up, no other  
jobs will be scheduled thereto. All jobs must request either  
virtual_free or h_vmem, so you will have to define a sensible default  
for it in the complex configuration (qconf -mc).

-- Reuti


> I am asking this because I am getting random out of memory messages.
>
>
>
>
>
>
>
> On Sun, Oct 26, 2008 at 1:44 PM, Reuti <reuti at staff.uni-marburg.de>  
> wrote:
>> Am 26.10.2008 um 18:08 schrieb Mag Gam:
>>
>>> I am certain I don't have any quotas regarding this.
>>>
>>>
>>> qconf -srqs
>>> {
>>>   name         cpu_limit
>>>   description  NONE
>>>   enabled      TRUE
>>>   limit        users mathprof to slots=8
>>> }
>>
>> Not the resource quotas, the queue configuration (qconf -sq myq).  
>> But it
>> seems, that there are some limits defined, as stack and virtual  
>> memory are
>> defined as 15G.
>>
>> Only the soft-quotas are in effect, means what is an interactive  
>> "ulimit
>> -aS" showing in addition?
>>
>> The user is only allowed to change the limits in effect (i.e. the
>> doft-limit) between the hard-limti and zero. He can also lower the
>> hard-limit. But once it's lowered, it can't be risen again (unless  
>> root is
>> executing these commands).
>>
>> -- Reuti
>>
>>
>>>
>>>
>>> Here is there output for the job
>>>
>>> core file size          (blocks, -c) unlimited
>>> data seg size           (kbytes, -d) unlimited
>>> scheduling priority             (-e) 0
>>> file size               (blocks, -f) unlimited
>>> pending signals                 (-i) 530431
>>> max locked memory       (kbytes, -l) 32
>>> max memory size         (kbytes, -m) unlimited
>>> open files                      (-n) 1024
>>> pipe size            (512 bytes, -p) 8
>>> POSIX message queues     (bytes, -q) 819200
>>> real-time priority              (-r) 0
>>> stack size              (kbytes, -s) unlimited
>>> cpu time               (seconds, -t) unlimited
>>> max user processes              (-u) 530431
>>> virtual memory          (kbytes, -v) unlimited
>>> file locks                      (-x) unlimited
>>>
>>> core file size          (blocks, -c) 0
>>> data seg size           (kbytes, -d) 15625000
>>> scheduling priority             (-e) 0
>>> file size               (blocks, -f) unlimited
>>> pending signals                 (-i) 530431
>>> max locked memory       (kbytes, -l) 32
>>> max memory size         (kbytes, -m) unlimited
>>> open files                      (-n) 1024
>>> pipe size            (512 bytes, -p) 8
>>> POSIX message queues     (bytes, -q) 819200
>>> real-time priority              (-r) 0
>>> stack size              (kbytes, -s) 15625000
>>> cpu time               (seconds, -t) unlimited
>>> max user processes              (-u) 530431
>>> virtual memory          (kbytes, -v) 15625000
>>> file locks                      (-x) unlimited
>>>
>>>
>>> See anything else?
>>>
>>>
>>> On Sun, Oct 26, 2008 at 12:37 PM, Reuti <reuti at staff.uni-marburg.de>
>>> wrote:
>>>>
>>>> Am 26.10.2008 um 16:16 schrieb Mag Gam:
>>>>
>>>>> Thanks Reuti as usual!
>>>>>
>>>>> I have came to this problem now. My java application is giving  
>>>>> me this
>>>>> error:
>>>>>
>>>>> Error occurred during initialization of VM
>>>>> Could not reserve enough space for object heap
>>>>>
>>>>> All of the servers are free of memory, so there is no memory  
>>>>> contention.
>>>>>
>>>>> I am submitting the job as qsub script.sh (without any -l options)
>>>>>
>>>>> However, if I run it via ssh I get the correct results. I am  
>>>>> not sure
>>>>> why I am getting this error.
>>>>>
>>>>> I tried to look at this and it seems you are giving some  
>>>>> replies here,
>>>>> but still not helpful :-(
>>>>>
>>>>>
>>>>>
>>>>> http://fossplanet.com/clustering.gridengine.users/ 
>>>>> message-1123088-strange-consequence-changing-n1ge/
>>>>
>>>> Mag,
>>>>
>>>> this can really be related. Can you please post your queue  
>>>> configuration
>>>> -
>>>> did you define any limits there?
>>>>
>>>> Another hint would be to submit a job listing the limits inside  
>>>> a job,
>>>> i.e.:
>>>>
>>>> #!/bin/sh
>>>> ulimit -aH
>>>> echo
>>>> ulimit -aS
>>>>
>>>> -- Reuti
>>>>
>>>>>
>>>>> Any ideas?
>>>>>
>>>>>
>>>>> On Sun, Oct 26, 2008 at 9:57 AM, Reuti <reuti at staff.uni- 
>>>>> marburg.de>
>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Am 26.10.2008 um 14:10 schrieb Mag Gam:
>>>>>>
>>>>>>> Hello Reuti:
>>>>>>>
>>>>>>> Would it help if I started at 10 instead of 1?
>>>>>>
>>>>>> sure, in this case you would just need the files *.10 to *.19  
>>>>>> when you
>>>>>> want
>>>>>> to avoid the computation of canonical names for *.01 to *.10.
>>>>>>
>>>>>> qsub -t 10-19 ...
>>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>
>>>>>>> #!/bin/sh
>>>>>>> echo "I'm $SGE_TASK_ID and will read 10000.$SGE_TASK_ID to  
>>>>>>> produce
>>>>>>> out.$SGE_TASK_ID"
>>>>>>> sleep 60
>>>>>>> exit 0
>>>>>>>
>>>>>>> and start it with:
>>>>>>> qsub -t 10 script.sh
>>>>>>>
>>>>>>> Works.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Oct 25, 2008 at 1:30 PM, Reuti <reuti at staff.uni- 
>>>>>>> marburg.de>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Am 25.10.2008 um 16:20 schrieb Mag Gam:
>>>>>>>>
>>>>>>>>> Reuti:
>>>>>>>>>
>>>>>>>>> As usual, thankyou! This is very help, but perhaps I should  
>>>>>>>>> backup a
>>>>>>>>> little.
>>>>>>>>>
>>>>>>>>> "qsub -l virtual_free=40g" does that reserve space or does  
>>>>>>>>> it wait
>>>>>>>>> for
>>>>>>>>> that space?
>>>>>>>>
>>>>>>>> As long as there are only SGE's jobs: both.
>>>>>>>>
>>>>>>>>> Also, what if a user (non GRID) is using the servers. I
>>>>>>>>> assume SGE will not account for that, or will it?
>>>>>>>>
>>>>>>>> This is always unpredictable. Can you force your interactive  
>>>>>>>> users to
>>>>>>>> go
>>>>>>>> through SGE by requesting a an interactive job? Then yoiu  
>>>>>>>> would need
>>>>>>>> h_vmem
>>>>>>>> instead of virtual_free to enforce the limits. for both  
>>>>>>>> typers of
>>>>>>>> jobs.
>>>>>>>>
>>>>>>>>> My intention is this:
>>>>>>>>> I have 1000000 file
>>>>>>>>>
>>>>>>>>> I split it into 10 blocks
>>>>>>>>> 100000.a
>>>>>>>>> 100000.b
>>>>>>>>> 100000.c
>>>>>>>>> ....
>>>>>>>>> 100000.j
>>>>>>>>
>>>>>>>> when you have split them already, you will need to rename  
>>>>>>>> them to
>>>>>>>> 100000.1
>>>>>>>> ... 100000.10
>>>>>>>>
>>>>>>>>> I also have a wrapper script like this.
>>>>>>>>>
>>>>>>>>> #!/bin/ksh
>>>>>>>>> #wrapper script -- wrapper.sh <filename>
>>>>>>>>> #$ -cwd
>>>>>>>>> #$ -V
>>>>>>>>> #$ -N fluid
>>>>>>>>> #$ -S /bin/ksh
>>>>>>>>>
>>>>>>>>> file=$1
>>>>>>>>> cat $file | java -XmX 40000m fluid0 > out.$SGE_TASK_ID.dat
>>>>>>>>>
>>>>>>>>> I invoke the script like this:
>>>>>>>>> qsub -l virtual_free=40g ./wrapper.sh 10000.a
>>>>>>>>> qsub -l virtual_free=40g ./wrapper.sh 10000.b
>>>>>>>>> ...
>>>>>>>>> qsub -l virtual_free=40g ./wrapper.sh 10000.j
>>>>>>>>
>>>>>>>> Please try first a simple job, to see how array jobs are  
>>>>>>>> handled:
>>>>>>>>
>>>>>>>> #!/bin/sh
>>>>>>>> echo "I'm $SGE_TASK_ID and will read 10000.$SGE_TASK_ID to  
>>>>>>>> produce
>>>>>>>> out.$SGE_TASK_ID"
>>>>>>>> sleep 60
>>>>>>>> exit 0
>>>>>>>>
>>>>>>>> and start it with:
>>>>>>>>
>>>>>>>> qsub -t 10 script.sh
>>>>>>>>
>>>>>>>> -- Reuti
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have tried to use the -t option for an array job, but it  
>>>>>>>>> was not
>>>>>>>>> working for some reason.
>>>>>>>>>
>>>>>>>>> Any thoughts about this method?
>>>>>>>>>
>>>>>>>>> TIA
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Oct 25, 2008 at 7:14 AM, Reuti <reuti at staff.uni- 
>>>>>>>>> marburg.de>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Mag,
>>>>>>>>>>
>>>>>>>>>> Am 25.10.2008 um 02:40 schrieb Mag Gam:
>>>>>>>>>>
>>>>>>>>>>> Hello All.
>>>>>>>>>>>
>>>>>>>>>>> We have a professor who is notorious for bring down our
>>>>>>>>>>> engineering
>>>>>>>>>>> GRID (64 servers) servers due to his direct numerical  
>>>>>>>>>>> simulations.
>>>>>>>>>>> He
>>>>>>>>>>> basically runs a Java program with -Xmx 40000m (40 gigs).  
>>>>>>>>>>> This
>>>>>>>>>>> preallocates 40 gigs of memory and then crashes the box  
>>>>>>>>>>> because
>>>>>>>>>>> there
>>>>>>>>>>
>>>>>>>>>> this looks more like that you have to setup SGE to manage the
>>>>>>>>>> memory
>>>>>>>>>> and
>>>>>>>>>> request the necessary amount of memory for the job and  
>>>>>>>>>> submit it
>>>>>>>>>> with
>>>>>>>>>> "qsub
>>>>>>>>>> -l virtual_free=40g ..."
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://gridengine.sunsource.net/servlets/ReadMsg? 
>>>>>>>>>> listName=users&msgNo=15079
>>>>>>>>>>
>>>>>>>>>>> are other processes running on the box. Each box has 128G of
>>>>>>>>>>> Physical
>>>>>>>>>>> memory. He runs the application like this:
>>>>>>>>>>> cat series | java -Xmx 40000m fluid0 > out.dat
>>>>>>>>>>>
>>>>>>>>>>> the "series" file has over 10 million records.
>>>>>>>>>>>
>>>>>>>>>>> I was thinking of something like this: split the 10 million
>>>>>>>>>>> records
>>>>>>>>>>> into 10 files (each file has 1 million record), submit 10  
>>>>>>>>>>> array
>>>>>>>>>>> jobs,
>>>>>>>>>>> and then output to out.dat. But the order for 'out.dat'  
>>>>>>>>>>> matters! I
>>>>>>>>>>> would like to run these 10 jobs independently, but how can I
>>>>>>>>>>> maintain
>>>>>>>>>>> order?  Or is there a better way to do this?
>>>>>>>>>>>
>>>>>>>>>>> By him submitting his current job it would not be wise...
>>>>>>>>>>
>>>>>>>>>> You mean: one array job with 10 tasks - right? So "qsub -t  
>>>>>>>>>> 1-10
>>>>>>>>>> my_job".
>>>>>>>>>>
>>>>>>>>>> In each jobscript you can use (adjust for the usual +/- 1  
>>>>>>>>>> problem
>>>>>>>>>> at
>>>>>>>>>> the
>>>>>>>>>> beginning and end):
>>>>>>>>>>
>>>>>>>>>> sed -n -e $[(SGE_TASK_ID-1)*1000000],$[SGE_TASK_ID*1000000] 
>>>>>>>>>> p | java
>>>>>>>>>> -Xmx
>>>>>>>>>> 40000m fluid0 > out${SGE_TASK_ID}.dat
>>>>>>>>>>
>>>>>>>>>> hence output only the necessary lines of the input file  
>>>>>>>>>> and create
>>>>>>>>>> a
>>>>>>>>>> unique
>>>>>>>>>> output file for each task of an array job. Also for the  
>>>>>>>>>> output
>>>>>>>>>> file,
>>>>>>>>>> maybe
>>>>>>>>>> it's not necessary to concat them into one file, as you can
>>>>>>>>>> sometimes
>>>>>>>>>> use
>>>>>>>>>> a
>>>>>>>>>> construct like:
>>>>>>>>>>
>>>>>>>>>> cat out*.dat | my_pgm
>>>>>>>>>>
>>>>>>>>>> for further processing. More than 9 tasks this would lead  
>>>>>>>>>> to the
>>>>>>>>>> wrong
>>>>>>>>>> order
>>>>>>>>>> 1, 10, 2, 3, ... and you need a variant from the above  
>>>>>>>>>> command:
>>>>>>>>>>
>>>>>>>>>> sed -n -e $[(SGE_TASK_ID-1)*1000000],$[SGE_TASK_ID*1000000] 
>>>>>>>>>> p | java
>>>>>>>>>> -Xmx
>>>>>>>>>> 40000m fluid0 > out$(printf "%02d" $SGE_TASK_ID).dat
>>>>>>>>>>
>>>>>>>>>> for having leading zeros for the index in the name of the  
>>>>>>>>>> output
>>>>>>>>>> file.
>>>>>>>>>>
>>>>>>>>>> -- Reuti
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------- 
>>>>>>>>>> --------
>>>>>>>>>> To unsubscribe, e-mail: users- 
>>>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>> users-help at gridengine.sunsource.net
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -------------------------------------------------------------- 
>>>>>>>>> -------
>>>>>>>>> To unsubscribe, e-mail: users- 
>>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>>> For additional commands, e-mail: users- 
>>>>>>>>> help at gridengine.sunsource.net
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------------------------------------------- 
>>>>>>>> ------
>>>>>>>> To unsubscribe, e-mail: users- 
>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>> For additional commands, e-mail: users- 
>>>>>>>> help at gridengine.sunsource.net
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> -----
>>>>>>> To unsubscribe, e-mail: users- 
>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users- 
>>>>>>> help at gridengine.sunsource.net
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> ----
>>>>>> To unsubscribe, e-mail: users- 
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users- 
>>>>>> help at gridengine.sunsource.net
>>>>>>
>>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- 
>>>>> help at gridengine.sunsource.net
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list