[GE users] prevent users from executing jobs on nodes except via sungrid

Reuti reuti at staff.uni-marburg.de
Fri Mar 31 13:10:57 BST 2006


Am 31.03.2006 um 14:04 schrieb Jerry Mersel:

> I rebuilt it, but it didn't help:
>
> here are the results:
>
> error file:
>
> connect to address 192.168.1.3: Connection refused
> connect to address 192.168.1.3: Connection refused
> trying normal rsh (/usr/bin/rsh)

The question is, whether the directory with the rsh-wrapper was  
correctly setup on the slave node. Just submit a parallel job, put a  
sleep 600 or so in the job script (instead of any mpirun command),  
and check whether the /wiccusers/mlmersel/mlmersel/mpi/startmpi.sh  
created the correct machinefile and the correct link to the rsh- 
wrapper to the /wiccusers/mlmersel/mlmersel/mpi/rsh on the master- 
node of the parallel job.

BTW: Any firewall on the slave nodes?

-- Reuti


> wiccopt-3.weizmann.ac.il: Connection refused
>
> standard output file:
>
> p0_11663:  p4_error: Child process exited while making connection to
> remote proc
> ess on wiccopt-3: 0
> p0_11663: (33.023438) net_send: could not write to fd=4, errno = 32
>
>                           Regards,
>                              Jerry
>
>
>> Hi,
>>
>> Am 30.03.2006 um 16:34 schrieb Jerry Mersel:
>>
>>> Thanks Reuti:
>>>
>>>   You could probably tell I'm getting kinda desparate with this,
>>>   and so I am.
>>>
>>>   I stopped running the system rshd.
>>>   I tried it with a conventional switch, and with MPICH that
>>> doesn't come
>>>   from voltaire, and built it with <SGEROOT>/utilbin/lx24-amd64/ 
>>> rsh as
>>>   the RSHCOMMAND.
>>>
>>
>> no, this way the wrapper won't work. Please recompile it with a
>> simple switch:
>>
>> -rsh=rsh
>>
>> in the MPICH ./configure - although it's deprecated.
>>
>> Cheers - Reuti
>>
>>
>>>   If I run on one node it works, (doesn't run on master), on 2 or  
>>> more
>>>   I get Connection refused.
>>>
>>>   I have looked into <sge>/mpi.
>>>
>>>   The PE that I am using:
>>>
>>> pe_name           mlmersel
>>> slots             999
>>> user_lists        NONE
>>> xuser_lists       NONE
>>> start_proc_args   /wiccusers/mlmersel/mlmersel/mpi/startmpi.sh -
>>> catch_rsh
>>> $pe_hostfile
>>> stop_proc_args    /wiccusers/mlmersel/mlmersel/mpi/stopmpi.sh
>>> allocation_rule   $round_robin
>>> control_slaves    TRUE
>>> job_is_first_task FALSE
>>> urgency_slots     min
>>>
>>>
>>> The queue config:
>>>
>>>   qname                 all.q
>>> hostlist              @allhosts
>>> seq_no                0
>>> load_thresholds       np_load_avg=1.75
>>> suspend_thresholds    NONE
>>> nsuspend              1
>>> suspend_interval      00:05:00
>>> priority              0
>>> min_cpu_interval      00:05:00
>>> processors            UNDEFINED
>>> qtype                 BATCH INTERACTIVE
>>> ckpt_list             NONE
>>> pe_list               jerry mlmersel mpi mymake
>>> rerun                 FALSE
>>> slots                 2,[wiccopt-2.weizmann.ac.il=2], \
>>>                       [wiccopt-3.weizmann.ac.il=2], \
>>>       %2
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list