[GE users] Tight integration with PVM

JONATHAN SELANDER S026655 at utb.hb.se
Fri Apr 15 15:34:04 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

So what you're saying is that there should be a separate /tmp/pvmd.1000 directory (for instance) on each node? I thought that each node had to access the same dir of that type. It makes sense, if data is supposed to be sent between pvmds on the different nodes.

It all seems to work now. I get no error messages so far and the big for loop is putting a lot of load on the nodes.

However, the start and stop scripts had lines like "export PVM_TMP=$TMPDIR" or so, which aren't valid in /bin/sh, so i had to change them. Also, my tester_tight.sh looks like this:

---

#!/bin/sh

#$ -S /bin/sh

PVM_TMP=$TMPDIR
export PVM_TMP

./hello

exit 0

---

I'm grateful you made this howto, otherwise i would've been completely lost.

-----Original Message-----
From: Reuti <reuti at staff.uni-marburg.de>
To: users at gridengine.sunsource.net
Date: Fri, 15 Apr 2005 16:18:43 +0200
Subject: Re: [GE users] Tight integration with PVM

JONATHAN SELANDER wrote:
> I compiled and installed the modified sources frmo the howto, and made some progress. However, i have a problem with setting $TMPDIR correctly. I'm not really sure where i should set it if it's not an environment variable. Each node starts pvmd with /tmp as TMPDIR, instead of /opt/sge/tmp which i put here and there in the scripts.

The idea of the tight integration is to put also the PVM created files 
in the SGE created directory. So they will be removed after the job for 
sure.

What do you mean exactly with:

"/opt/sge/tmp which i put here and there in the scripts"?

The rsh-wrapper will start the daemons with the files in $TMPDIR (from 
SGE). If you can't use SGE's $TMPDIR for any reason for the PVM files, 
you'll have to adjust the rsh-wrapper also. But AFAIK it must not be 
shared directory, as the pvmd.* files only refelect the user, not the node.

I'm not sure, whether I interpret you statement in the correct way: with 
the tight integration there is no need to start any pvmd by hand. The 
script will start the pvmds for one job, run the job, and kill them all 
again after the job.

If you want to run PVM by hand for testing, you can set it to the usual 
/tmp of course (or don't set it at all, then it's /tmp by default).

CU - Reuti

> 
> -----Original Message-----
> From: Reuti <reuti at staff.uni-marburg.de>
> To: users at gridengine.sunsource.net
> Date: Fri, 15 Apr 2005 15:42:00 +0200
> Subject: Re: [GE users] Tight integration with PVM
> 
> JONATHAN SELANDER wrote:
> 
>>I just realized that the homedir of the user i ran the job as wasn't mounted on the nodes via nfs, so i have .po and .pe files on them. The contents of them on the failing node are:
>>
>>--
>>
>># cat /tester_tight.sh.pe103
>>startpvm.sh: can't execute brasnod-2/lib/pvmgetarch
>>libpvm [pid2601] /tmp/pvmd.0: No such file or directory
>>libpvm [pid2601] /tmp/pvmd.0: No such file or directory
>>libpvm [pid2601]: pvm_halt(): Can't contact local daemon
>>
>>--
>>
>># cat /tester_tight.sh.po103
>>-catch_rsh /opt/sge/default/spool/brasnod-2/active_jobs/103.1/pe_hostfile brasnod-2 /opt/sge/pvm
> 
> 
> For some reason the "brasnod-2" is taken as the third parameter, hence 
> the -catch_rsh not removed it seems. Is brasnod-2 using the new version 
> of the startpvm.sh from the Howto? You will need the scripts and 
> programs form the Howto, not the in SGE included ones.
> 
> One thing to note: the last option in start_proc_args must be the 
> location of the pvm3, not the one of the pvm-scripts for SGE. If you 
> copied all the pvm3 stuff to this location for easier maintenance it 
> maybe okay, I never tried it. By default I'd put the pvm3 stuff 
> somewhere at /opt or /usr or so.
> 
> CU - Reuti
> 
> 
>>-catch_rsh /opt/sge/default/spool/brasnod-2/active_jobs/103.1/pe_hostfile brasnod-2
>>
>>--
>>
>>-----Original Message-----
>>From: "JONATHAN SELANDER" <S026655 at utb.hb.se>
>>To: users at gridengine.sunsource.net
>>Date: Fri, 15 Apr 2005 15:13:30 +0200
>>Subject: Re: Re: [GE users] Tight integration with PVM
>>
>>The nodes have SGE_ROOT mounted over NFS. I don't have any .pe or .po output files in the homedir which is where i ran it from. What happens is that a node fails and the job gets state qw until i reset the queue with "qmod -cq all.q", which causes the same event to happen again.
>>
>>-----Original Message-----
>>From: Reuti <reuti at staff.uni-marburg.de>
>>To: users at gridengine.sunsource.net
>>Date: Fri, 15 Apr 2005 15:07:34 +0200
>>Subject: Re: [GE users] Tight integration with PVM
>>
>>At least the .po file should exist in your home directory (or from where 
>>you submitted your job), as the granted nodes are listed there by the 
>>start script. Is the $SGE_ROOT shared and the new version of the 
>>start/stop-scripts are available on the nodes?
>>
>>JONATHAN SELANDER wrote:
>>
>>
>>>I don't have any files like that in SGE_ROOT or the TMPDIR
>>>
>>>---
>>>
>>># cat tester_tight.sh
>>>#!/bin/sh
>>>
>>>export PVM_TMP=/opt/sge/tmp
>>
>>
>>Please use:
>>
>>export PVM_TMP=$TMPDIR
>>
>>$TMPDIR will be set by SGE to the created temporary job directory during 
>>execution.
>>
>>CU - Reuti
>>
>>
>>
>>>./hello
>>>
>>>exit 0
>>>
>>>---
>>>
>>># ls -ld /opt/sge/tmp
>>>drwxrwxrwt   2 root     root         512 Apr 15 13:21 /opt/sge/tmp
>>>
>>>---
>>>
>>>-----Original Message-----
>>>From: Reuti <reuti at staff.uni-marburg.de>
>>>To: users at gridengine.sunsource.net
>>>Date: Fri, 15 Apr 2005 14:44:01 +0200
>>>Subject: Re: [GE users] Tight integration with PVM
>>>
>>>Is there anything in the .po or .pe files, or doesn't they exist at all?
>>>
>>>JONATHAN SELANDER wrote:
>>>
>>>
>>>
>>>>Adding the PE to a queue fixed that error message. However, one node seems to fail each time i run the job (it has state E when i do qstat -f). It's not the same node each time either that fails.
>>>>
>>>>---
>>>>
>>>># tail -2 /opt/sge/default/spool/brasnod-2/messages
>>>>04/15/2005 22:08:06|execd|brasnod-2|E|shepherd of job 102.1 exited with exit status = 10
>>>>04/15/2005 22:08:06|execd|brasnod-2|W|reaping job "102" ptf complains: Job does not exist
>>>>
>>>>---
>>>>
>>>># qstat -explain E
>>>>queuename                      qtype used/tot. load_avg arch          states
>>>>----------------------------------------------------------------------------
>>>>all.q at brasnod-2                BIP   0/1       0.02     sol-sparc64   E
>>>>      queue all.q marked QERROR as result of job 102's failure at host brasnod-2
>>>>----------------------------------------------------------------------------
>>>>all.q at brasnod-3                BIP   0/1       0.02     sol-sparc64
>>>>----------------------------------------------------------------------------
>>>>all.q at brasnod-4                BIP   0/1       0.01     sol-sparc64
>>>>
>>>>############################################################################
>>>>- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
>>>>############################################################################
>>>>  102 0.55500 tester_tig root         qw    04/15/2005 14:08:35     3
>>>>
>>>>
>>>>
>>>>
>>>>-----Original Message-----
>>>>From: Reuti <reuti at staff.uni-marburg.de>
>>>>To: users at gridengine.sunsource.net
>>>>Date: Fri, 15 Apr 2005 14:02:23 +0200
>>>>Subject: Re: [GE users] Tight integration with PVM
>>>>
>>>>Hi,
>>>>
>>>>did you add the PE to the queue definition (qconf -mq <queue>) like:
>>>>
>>>>pe_list    pvm
>>>>
>>>>CU - Reuti
>>>>
>>>>
>>>>JONATHAN SELANDER wrote:
>>>>
>>>>
>>>>
>>>>
>>>>>I followed the howto at http://gridengine.sunsource.net/howto/pvm-integration/pvm-integration.html for setting up PVM integration with SGE after I had compiled pvm 3 and installed/compiled the utilities in the SGE_ROOT/pvm dir (aimk and install.sh)
>>>>>
>>>>>However, when i try the example tester_tight.sh from the howto, i get these scheduling errors in the logs:
>>>>>
>>>>>---
>>>>>
>>>>>cannot run in queue instance "all.q at brasnod-2" because PE "pvm" is not in pe list
>>>>>cannot run in queue instance "all.q at brasnod-4" because PE "pvm" is not in pe list
>>>>>cannot run because resources requested are not available for parallel job
>>>>>cannot run because available slots combined under PE "pvm" are not in range of job
>>>>>
>>>>>---
>>>>>
>>>>># qconf -sp pvm
>>>>>pe_name           pvm
>>>>>slots             100
>>>>>user_lists        NONE
>>>>>xuser_lists       NONE
>>>>>start_proc_args   /opt/sge/pvm/startpvm.sh -catch_rsh $pe_hostfile $host \
>>>>>               /opt/sge/pvm
>>>>>stop_proc_args    /opt/sge/pvm/stoppvm.sh -catch_rsh $pe_hostfile $host
>>>>>allocation_rule   1
>>>>>control_slaves    TRUE
>>>>>job_is_first_task FALSE
>>>>>urgency_slots     min
>>>>>
>>>>>---
>>>>>
>>>>>
>>>>>What does this mean? brasnod-2,3,4 are execution hosts which work correctly when i run ordinary jobs.
>>>>>
>>>>>J
>>>>>
>>>>>
>>>>>---------------------------------------------------------------------
>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>
>>>>
>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net





---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list