[GE users] Tight integration with PVM

JONATHAN SELANDER S026655 at utb.hb.se
Fri Apr 15 14:19:00 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I just realized that the homedir of the user i ran the job as wasn't mounted on the nodes via nfs, so i have .po and .pe files on them. The contents of them on the failing node are:

--

# cat /tester_tight.sh.pe103
startpvm.sh: can't execute brasnod-2/lib/pvmgetarch
libpvm [pid2601] /tmp/pvmd.0: No such file or directory
libpvm [pid2601] /tmp/pvmd.0: No such file or directory
libpvm [pid2601]: pvm_halt(): Can't contact local daemon

--

# cat /tester_tight.sh.po103
-catch_rsh /opt/sge/default/spool/brasnod-2/active_jobs/103.1/pe_hostfile brasnod-2 /opt/sge/pvm
-catch_rsh /opt/sge/default/spool/brasnod-2/active_jobs/103.1/pe_hostfile brasnod-2

--

-----Original Message-----
From: "JONATHAN SELANDER" <S026655 at utb.hb.se>
To: users at gridengine.sunsource.net
Date: Fri, 15 Apr 2005 15:13:30 +0200
Subject: Re: Re: [GE users] Tight integration with PVM

The nodes have SGE_ROOT mounted over NFS. I don't have any .pe or .po output files in the homedir which is where i ran it from. What happens is that a node fails and the job gets state qw until i reset the queue with "qmod -cq all.q", which causes the same event to happen again.

-----Original Message-----
From: Reuti <reuti at staff.uni-marburg.de>
To: users at gridengine.sunsource.net
Date: Fri, 15 Apr 2005 15:07:34 +0200
Subject: Re: [GE users] Tight integration with PVM

At least the .po file should exist in your home directory (or from where 
you submitted your job), as the granted nodes are listed there by the 
start script. Is the $SGE_ROOT shared and the new version of the 
start/stop-scripts are available on the nodes?

JONATHAN SELANDER wrote:
> I don't have any files like that in SGE_ROOT or the TMPDIR
> 
> ---
> 
> # cat tester_tight.sh
> #!/bin/sh
> 
> export PVM_TMP=/opt/sge/tmp

Please use:

export PVM_TMP=$TMPDIR

$TMPDIR will be set by SGE to the created temporary job directory during 
execution.

CU - Reuti

> 
> ./hello
> 
> exit 0
> 
> ---
> 
> # ls -ld /opt/sge/tmp
> drwxrwxrwt   2 root     root         512 Apr 15 13:21 /opt/sge/tmp
> 
> ---
> 
> -----Original Message-----
> From: Reuti <reuti at staff.uni-marburg.de>
> To: users at gridengine.sunsource.net
> Date: Fri, 15 Apr 2005 14:44:01 +0200
> Subject: Re: [GE users] Tight integration with PVM
> 
> Is there anything in the .po or .pe files, or doesn't they exist at all?
> 
> JONATHAN SELANDER wrote:
> 
>>Adding the PE to a queue fixed that error message. However, one node seems to fail each time i run the job (it has state E when i do qstat -f). It's not the same node each time either that fails.
>>
>>---
>>
>># tail -2 /opt/sge/default/spool/brasnod-2/messages
>>04/15/2005 22:08:06|execd|brasnod-2|E|shepherd of job 102.1 exited with exit status = 10
>>04/15/2005 22:08:06|execd|brasnod-2|W|reaping job "102" ptf complains: Job does not exist
>>
>>---
>>
>># qstat -explain E
>>queuename                      qtype used/tot. load_avg arch          states
>>----------------------------------------------------------------------------
>>all.q at brasnod-2                BIP   0/1       0.02     sol-sparc64   E
>>        queue all.q marked QERROR as result of job 102's failure at host brasnod-2
>>----------------------------------------------------------------------------
>>all.q at brasnod-3                BIP   0/1       0.02     sol-sparc64
>>----------------------------------------------------------------------------
>>all.q at brasnod-4                BIP   0/1       0.01     sol-sparc64
>>
>>############################################################################
>> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
>>############################################################################
>>    102 0.55500 tester_tig root         qw    04/15/2005 14:08:35     3
>>
>>
>>
>>
>>-----Original Message-----
>>From: Reuti <reuti at staff.uni-marburg.de>
>>To: users at gridengine.sunsource.net
>>Date: Fri, 15 Apr 2005 14:02:23 +0200
>>Subject: Re: [GE users] Tight integration with PVM
>>
>>Hi,
>>
>>did you add the PE to the queue definition (qconf -mq <queue>) like:
>>
>>pe_list    pvm
>>
>>CU - Reuti
>>
>>
>>JONATHAN SELANDER wrote:
>>
>>
>>>I followed the howto at http://gridengine.sunsource.net/howto/pvm-integration/pvm-integration.html for setting up PVM integration with SGE after I had compiled pvm 3 and installed/compiled the utilities in the SGE_ROOT/pvm dir (aimk and install.sh)
>>>
>>>However, when i try the example tester_tight.sh from the howto, i get these scheduling errors in the logs:
>>>
>>>---
>>>
>>>cannot run in queue instance "all.q at brasnod-2" because PE "pvm" is not in pe list
>>>cannot run in queue instance "all.q at brasnod-4" because PE "pvm" is not in pe list
>>>cannot run because resources requested are not available for parallel job
>>>cannot run because available slots combined under PE "pvm" are not in range of job
>>>
>>>---
>>>
>>># qconf -sp pvm
>>>pe_name           pvm
>>>slots             100
>>>user_lists        NONE
>>>xuser_lists       NONE
>>>start_proc_args   /opt/sge/pvm/startpvm.sh -catch_rsh $pe_hostfile $host \
>>>                 /opt/sge/pvm
>>>stop_proc_args    /opt/sge/pvm/stoppvm.sh -catch_rsh $pe_hostfile $host
>>>allocation_rule   1
>>>control_slaves    TRUE
>>>job_is_first_task FALSE
>>>urgency_slots     min
>>>
>>>---
>>>
>>>
>>>What does this mean? brasnod-2,3,4 are execution hosts which work correctly when i run ordinary jobs.
>>>
>>>J
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net





---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net





---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list