[GE users] Tight integration with PVM

Reuti reuti at staff.uni-marburg.de
Fri Apr 15 14:42:00 BST 2005


JONATHAN SELANDER wrote:
> I just realized that the homedir of the user i ran the job as wasn't mounted on the nodes via nfs, so i have .po and .pe files on them. The contents of them on the failing node are:
> 
> --
> 
> # cat /tester_tight.sh.pe103
> startpvm.sh: can't execute brasnod-2/lib/pvmgetarch
> libpvm [pid2601] /tmp/pvmd.0: No such file or directory
> libpvm [pid2601] /tmp/pvmd.0: No such file or directory
> libpvm [pid2601]: pvm_halt(): Can't contact local daemon
> 
> --
> 
> # cat /tester_tight.sh.po103
> -catch_rsh /opt/sge/default/spool/brasnod-2/active_jobs/103.1/pe_hostfile brasnod-2 /opt/sge/pvm

For some reason the "brasnod-2" is taken as the third parameter, hence 
the -catch_rsh not removed it seems. Is brasnod-2 using the new version 
of the startpvm.sh from the Howto? You will need the scripts and 
programs form the Howto, not the in SGE included ones.

One thing to note: the last option in start_proc_args must be the 
location of the pvm3, not the one of the pvm-scripts for SGE. If you 
copied all the pvm3 stuff to this location for easier maintenance it 
maybe okay, I never tried it. By default I'd put the pvm3 stuff 
somewhere at /opt or /usr or so.

CU - Reuti

> -catch_rsh /opt/sge/default/spool/brasnod-2/active_jobs/103.1/pe_hostfile brasnod-2
> 
> --
> 
> -----Original Message-----
> From: "JONATHAN SELANDER" <S026655 at utb.hb.se>
> To: users at gridengine.sunsource.net
> Date: Fri, 15 Apr 2005 15:13:30 +0200
> Subject: Re: Re: [GE users] Tight integration with PVM
> 
> The nodes have SGE_ROOT mounted over NFS. I don't have any .pe or .po output files in the homedir which is where i ran it from. What happens is that a node fails and the job gets state qw until i reset the queue with "qmod -cq all.q", which causes the same event to happen again.
> 
> -----Original Message-----
> From: Reuti <reuti at staff.uni-marburg.de>
> To: users at gridengine.sunsource.net
> Date: Fri, 15 Apr 2005 15:07:34 +0200
> Subject: Re: [GE users] Tight integration with PVM
> 
> At least the .po file should exist in your home directory (or from where 
> you submitted your job), as the granted nodes are listed there by the 
> start script. Is the $SGE_ROOT shared and the new version of the 
> start/stop-scripts are available on the nodes?
> 
> JONATHAN SELANDER wrote:
> 
>>I don't have any files like that in SGE_ROOT or the TMPDIR
>>
>>---
>>
>># cat tester_tight.sh
>>#!/bin/sh
>>
>>export PVM_TMP=/opt/sge/tmp
> 
> 
> Please use:
> 
> export PVM_TMP=$TMPDIR
> 
> $TMPDIR will be set by SGE to the created temporary job directory during 
> execution.
> 
> CU - Reuti
> 
> 
>>./hello
>>
>>exit 0
>>
>>---
>>
>># ls -ld /opt/sge/tmp
>>drwxrwxrwt   2 root     root         512 Apr 15 13:21 /opt/sge/tmp
>>
>>---
>>
>>-----Original Message-----
>>From: Reuti <reuti at staff.uni-marburg.de>
>>To: users at gridengine.sunsource.net
>>Date: Fri, 15 Apr 2005 14:44:01 +0200
>>Subject: Re: [GE users] Tight integration with PVM
>>
>>Is there anything in the .po or .pe files, or doesn't they exist at all?
>>
>>JONATHAN SELANDER wrote:
>>
>>
>>>Adding the PE to a queue fixed that error message. However, one node seems to fail each time i run the job (it has state E when i do qstat -f). It's not the same node each time either that fails.
>>>
>>>---
>>>
>>># tail -2 /opt/sge/default/spool/brasnod-2/messages
>>>04/15/2005 22:08:06|execd|brasnod-2|E|shepherd of job 102.1 exited with exit status = 10
>>>04/15/2005 22:08:06|execd|brasnod-2|W|reaping job "102" ptf complains: Job does not exist
>>>
>>>---
>>>
>>># qstat -explain E
>>>queuename                      qtype used/tot. load_avg arch          states
>>>----------------------------------------------------------------------------
>>>all.q at brasnod-2                BIP   0/1       0.02     sol-sparc64   E
>>>       queue all.q marked QERROR as result of job 102's failure at host brasnod-2
>>>----------------------------------------------------------------------------
>>>all.q at brasnod-3                BIP   0/1       0.02     sol-sparc64
>>>----------------------------------------------------------------------------
>>>all.q at brasnod-4                BIP   0/1       0.01     sol-sparc64
>>>
>>>############################################################################
>>>- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
>>>############################################################################
>>>   102 0.55500 tester_tig root         qw    04/15/2005 14:08:35     3
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Reuti <reuti at staff.uni-marburg.de>
>>>To: users at gridengine.sunsource.net
>>>Date: Fri, 15 Apr 2005 14:02:23 +0200
>>>Subject: Re: [GE users] Tight integration with PVM
>>>
>>>Hi,
>>>
>>>did you add the PE to the queue definition (qconf -mq <queue>) like:
>>>
>>>pe_list    pvm
>>>
>>>CU - Reuti
>>>
>>>
>>>JONATHAN SELANDER wrote:
>>>
>>>
>>>
>>>>I followed the howto at http://gridengine.sunsource.net/howto/pvm-integration/pvm-integration.html for setting up PVM integration with SGE after I had compiled pvm 3 and installed/compiled the utilities in the SGE_ROOT/pvm dir (aimk and install.sh)
>>>>
>>>>However, when i try the example tester_tight.sh from the howto, i get these scheduling errors in the logs:
>>>>
>>>>---
>>>>
>>>>cannot run in queue instance "all.q at brasnod-2" because PE "pvm" is not in pe list
>>>>cannot run in queue instance "all.q at brasnod-4" because PE "pvm" is not in pe list
>>>>cannot run because resources requested are not available for parallel job
>>>>cannot run because available slots combined under PE "pvm" are not in range of job
>>>>
>>>>---
>>>>
>>>># qconf -sp pvm
>>>>pe_name           pvm
>>>>slots             100
>>>>user_lists        NONE
>>>>xuser_lists       NONE
>>>>start_proc_args   /opt/sge/pvm/startpvm.sh -catch_rsh $pe_hostfile $host \
>>>>                /opt/sge/pvm
>>>>stop_proc_args    /opt/sge/pvm/stoppvm.sh -catch_rsh $pe_hostfile $host
>>>>allocation_rule   1
>>>>control_slaves    TRUE
>>>>job_is_first_task FALSE
>>>>urgency_slots     min
>>>>
>>>>---
>>>>
>>>>
>>>>What does this mean? brasnod-2,3,4 are execution hosts which work correctly when i run ordinary jobs.
>>>>
>>>>J
>>>>
>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list