[GE users] intensive job

Mag Gam magawake at gmail.com
Sun Oct 26 19:57:25 GMT 2008


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Reuti:

You are right! I did have a memory limit. I removed it and his
application works! Thankyou very much.

Since these are intensive processes, we want to run only 1 process per
host. To be safe, we can even wait for the process to complete and
then submit a subtask. Is it possible to do that?

I am asking this because I am getting random out of memory messages.







On Sun, Oct 26, 2008 at 1:44 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 26.10.2008 um 18:08 schrieb Mag Gam:
>
>> I am certain I don't have any quotas regarding this.
>>
>>
>> qconf -srqs
>> {
>>   name         cpu_limit
>>   description  NONE
>>   enabled      TRUE
>>   limit        users mathprof to slots=8
>> }
>
> Not the resource quotas, the queue configuration (qconf -sq myq). But it
> seems, that there are some limits defined, as stack and virtual memory are
> defined as 15G.
>
> Only the soft-quotas are in effect, means what is an interactive "ulimit
> -aS" showing in addition?
>
> The user is only allowed to change the limits in effect (i.e. the
> doft-limit) between the hard-limti and zero. He can also lower the
> hard-limit. But once it's lowered, it can't be risen again (unless root is
> executing these commands).
>
> -- Reuti
>
>
>>
>>
>> Here is there output for the job
>>
>> core file size          (blocks, -c) unlimited
>> data seg size           (kbytes, -d) unlimited
>> scheduling priority             (-e) 0
>> file size               (blocks, -f) unlimited
>> pending signals                 (-i) 530431
>> max locked memory       (kbytes, -l) 32
>> max memory size         (kbytes, -m) unlimited
>> open files                      (-n) 1024
>> pipe size            (512 bytes, -p) 8
>> POSIX message queues     (bytes, -q) 819200
>> real-time priority              (-r) 0
>> stack size              (kbytes, -s) unlimited
>> cpu time               (seconds, -t) unlimited
>> max user processes              (-u) 530431
>> virtual memory          (kbytes, -v) unlimited
>> file locks                      (-x) unlimited
>>
>> core file size          (blocks, -c) 0
>> data seg size           (kbytes, -d) 15625000
>> scheduling priority             (-e) 0
>> file size               (blocks, -f) unlimited
>> pending signals                 (-i) 530431
>> max locked memory       (kbytes, -l) 32
>> max memory size         (kbytes, -m) unlimited
>> open files                      (-n) 1024
>> pipe size            (512 bytes, -p) 8
>> POSIX message queues     (bytes, -q) 819200
>> real-time priority              (-r) 0
>> stack size              (kbytes, -s) 15625000
>> cpu time               (seconds, -t) unlimited
>> max user processes              (-u) 530431
>> virtual memory          (kbytes, -v) 15625000
>> file locks                      (-x) unlimited
>>
>>
>> See anything else?
>>
>>
>> On Sun, Oct 26, 2008 at 12:37 PM, Reuti <reuti at staff.uni-marburg.de>
>> wrote:
>>>
>>> Am 26.10.2008 um 16:16 schrieb Mag Gam:
>>>
>>>> Thanks Reuti as usual!
>>>>
>>>> I have came to this problem now. My java application is giving me this
>>>> error:
>>>>
>>>> Error occurred during initialization of VM
>>>> Could not reserve enough space for object heap
>>>>
>>>> All of the servers are free of memory, so there is no memory contention.
>>>>
>>>> I am submitting the job as qsub script.sh (without any -l options)
>>>>
>>>> However, if I run it via ssh I get the correct results. I am not sure
>>>> why I am getting this error.
>>>>
>>>> I tried to look at this and it seems you are giving some replies here,
>>>> but still not helpful :-(
>>>>
>>>>
>>>>
>>>> http://fossplanet.com/clustering.gridengine.users/message-1123088-strange-consequence-changing-n1ge/
>>>
>>> Mag,
>>>
>>> this can really be related. Can you please post your queue configuration
>>> -
>>> did you define any limits there?
>>>
>>> Another hint would be to submit a job listing the limits inside a job,
>>> i.e.:
>>>
>>> #!/bin/sh
>>> ulimit -aH
>>> echo
>>> ulimit -aS
>>>
>>> -- Reuti
>>>
>>>>
>>>> Any ideas?
>>>>
>>>>
>>>> On Sun, Oct 26, 2008 at 9:57 AM, Reuti <reuti at staff.uni-marburg.de>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Am 26.10.2008 um 14:10 schrieb Mag Gam:
>>>>>
>>>>>> Hello Reuti:
>>>>>>
>>>>>> Would it help if I started at 10 instead of 1?
>>>>>
>>>>> sure, in this case you would just need the files *.10 to *.19 when you
>>>>> want
>>>>> to avoid the computation of canonical names for *.01 to *.10.
>>>>>
>>>>> qsub -t 10-19 ...
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> #!/bin/sh
>>>>>> echo "I'm $SGE_TASK_ID and will read 10000.$SGE_TASK_ID to produce
>>>>>> out.$SGE_TASK_ID"
>>>>>> sleep 60
>>>>>> exit 0
>>>>>>
>>>>>> and start it with:
>>>>>> qsub -t 10 script.sh
>>>>>>
>>>>>> Works.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Oct 25, 2008 at 1:30 PM, Reuti <reuti at staff.uni-marburg.de>
>>>>>> wrote:
>>>>>>>
>>>>>>> Am 25.10.2008 um 16:20 schrieb Mag Gam:
>>>>>>>
>>>>>>>> Reuti:
>>>>>>>>
>>>>>>>> As usual, thankyou! This is very help, but perhaps I should backup a
>>>>>>>> little.
>>>>>>>>
>>>>>>>> "qsub -l virtual_free=40g" does that reserve space or does it wait
>>>>>>>> for
>>>>>>>> that space?
>>>>>>>
>>>>>>> As long as there are only SGE's jobs: both.
>>>>>>>
>>>>>>>> Also, what if a user (non GRID) is using the servers. I
>>>>>>>> assume SGE will not account for that, or will it?
>>>>>>>
>>>>>>> This is always unpredictable. Can you force your interactive users to
>>>>>>> go
>>>>>>> through SGE by requesting a an interactive job? Then yoiu would need
>>>>>>> h_vmem
>>>>>>> instead of virtual_free to enforce the limits. for both typers of
>>>>>>> jobs.
>>>>>>>
>>>>>>>> My intention is this:
>>>>>>>> I have 1000000 file
>>>>>>>>
>>>>>>>> I split it into 10 blocks
>>>>>>>> 100000.a
>>>>>>>> 100000.b
>>>>>>>> 100000.c
>>>>>>>> ....
>>>>>>>> 100000.j
>>>>>>>
>>>>>>> when you have split them already, you will need to rename them to
>>>>>>> 100000.1
>>>>>>> ... 100000.10
>>>>>>>
>>>>>>>> I also have a wrapper script like this.
>>>>>>>>
>>>>>>>> #!/bin/ksh
>>>>>>>> #wrapper script -- wrapper.sh <filename>
>>>>>>>> #$ -cwd
>>>>>>>> #$ -V
>>>>>>>> #$ -N fluid
>>>>>>>> #$ -S /bin/ksh
>>>>>>>>
>>>>>>>> file=$1
>>>>>>>> cat $file | java -XmX 40000m fluid0 > out.$SGE_TASK_ID.dat
>>>>>>>>
>>>>>>>> I invoke the script like this:
>>>>>>>> qsub -l virtual_free=40g ./wrapper.sh 10000.a
>>>>>>>> qsub -l virtual_free=40g ./wrapper.sh 10000.b
>>>>>>>> ...
>>>>>>>> qsub -l virtual_free=40g ./wrapper.sh 10000.j
>>>>>>>
>>>>>>> Please try first a simple job, to see how array jobs are handled:
>>>>>>>
>>>>>>> #!/bin/sh
>>>>>>> echo "I'm $SGE_TASK_ID and will read 10000.$SGE_TASK_ID to produce
>>>>>>> out.$SGE_TASK_ID"
>>>>>>> sleep 60
>>>>>>> exit 0
>>>>>>>
>>>>>>> and start it with:
>>>>>>>
>>>>>>> qsub -t 10 script.sh
>>>>>>>
>>>>>>> -- Reuti
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I have tried to use the -t option for an array job, but it was not
>>>>>>>> working for some reason.
>>>>>>>>
>>>>>>>> Any thoughts about this method?
>>>>>>>>
>>>>>>>> TIA
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Oct 25, 2008 at 7:14 AM, Reuti <reuti at staff.uni-marburg.de>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Mag,
>>>>>>>>>
>>>>>>>>> Am 25.10.2008 um 02:40 schrieb Mag Gam:
>>>>>>>>>
>>>>>>>>>> Hello All.
>>>>>>>>>>
>>>>>>>>>> We have a professor who is notorious for bring down our
>>>>>>>>>> engineering
>>>>>>>>>> GRID (64 servers) servers due to his direct numerical simulations.
>>>>>>>>>> He
>>>>>>>>>> basically runs a Java program with -Xmx 40000m (40 gigs). This
>>>>>>>>>> preallocates 40 gigs of memory and then crashes the box because
>>>>>>>>>> there
>>>>>>>>>
>>>>>>>>> this looks more like that you have to setup SGE to manage the
>>>>>>>>> memory
>>>>>>>>> and
>>>>>>>>> request the necessary amount of memory for the job and submit it
>>>>>>>>> with
>>>>>>>>> "qsub
>>>>>>>>> -l virtual_free=40g ..."
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=15079
>>>>>>>>>
>>>>>>>>>> are other processes running on the box. Each box has 128G of
>>>>>>>>>> Physical
>>>>>>>>>> memory. He runs the application like this:
>>>>>>>>>> cat series | java -Xmx 40000m fluid0 > out.dat
>>>>>>>>>>
>>>>>>>>>> the "series" file has over 10 million records.
>>>>>>>>>>
>>>>>>>>>> I was thinking of something like this: split the 10 million
>>>>>>>>>> records
>>>>>>>>>> into 10 files (each file has 1 million record), submit 10 array
>>>>>>>>>> jobs,
>>>>>>>>>> and then output to out.dat. But the order for 'out.dat' matters! I
>>>>>>>>>> would like to run these 10 jobs independently, but how can I
>>>>>>>>>> maintain
>>>>>>>>>> order?  Or is there a better way to do this?
>>>>>>>>>>
>>>>>>>>>> By him submitting his current job it would not be wise...
>>>>>>>>>
>>>>>>>>> You mean: one array job with 10 tasks - right? So "qsub -t 1-10
>>>>>>>>> my_job".
>>>>>>>>>
>>>>>>>>> In each jobscript you can use (adjust for the usual +/- 1 problem
>>>>>>>>> at
>>>>>>>>> the
>>>>>>>>> beginning and end):
>>>>>>>>>
>>>>>>>>> sed -n -e $[(SGE_TASK_ID-1)*1000000],$[SGE_TASK_ID*1000000]p | java
>>>>>>>>> -Xmx
>>>>>>>>> 40000m fluid0 > out${SGE_TASK_ID}.dat
>>>>>>>>>
>>>>>>>>> hence output only the necessary lines of the input file and create
>>>>>>>>> a
>>>>>>>>> unique
>>>>>>>>> output file for each task of an array job. Also for the output
>>>>>>>>> file,
>>>>>>>>> maybe
>>>>>>>>> it's not necessary to concat them into one file, as you can
>>>>>>>>> sometimes
>>>>>>>>> use
>>>>>>>>> a
>>>>>>>>> construct like:
>>>>>>>>>
>>>>>>>>> cat out*.dat | my_pgm
>>>>>>>>>
>>>>>>>>> for further processing. More than 9 tasks this would lead to the
>>>>>>>>> wrong
>>>>>>>>> order
>>>>>>>>> 1, 10, 2, 3, ... and you need a variant from the above command:
>>>>>>>>>
>>>>>>>>> sed -n -e $[(SGE_TASK_ID-1)*1000000],$[SGE_TASK_ID*1000000]p | java
>>>>>>>>> -Xmx
>>>>>>>>> 40000m fluid0 > out$(printf "%02d" $SGE_TASK_ID).dat
>>>>>>>>>
>>>>>>>>> for having leading zeros for the index in the name of the output
>>>>>>>>> file.
>>>>>>>>>
>>>>>>>>> -- Reuti
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>>> For additional commands, e-mail:
>>>>>>>>> users-help at gridengine.sunsource.net
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list