[GE users] prevent users from executing jobs on nodes except via sungrid

Jerry Mersel jerry.mersel at weizmann.ac.il
Mon Apr 3 14:09:58 BST 2006


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi :


I really appreciate all the time and effort Reuti, and the other users
gave in order to solve my problem.

My boss (not my real boss who is my wife) wants to get users working on
this grid so we're going to use ssh and I'll probably right a script
to make sure that the processes are children of SGE.

But another try - the only difference is in the script, the PE is the same,
the one using ssh uses mpirun_ssh ...
and the one using SGE's rsh uses mpirun_rsh,
with the mpirun_rsh I get permission denied on the nodes.

                             Thank you very much,
                                  Jerry



> Am 02.04.2006 um 19:36 schrieb Jerry Mersel:
>
>> No the job isn't running as root.
>>
>> I have a strong suspicion that SGE's rshd daemon isn't running.
>>
>> Do I have to enable it in some way? I just assumed when I stopped
>> running the system's rshd daemon that SGE would take care of running
>> its own rshd daemon. But maybe I was mistaken.
>
> The SGE daemon will not run all the time, but will be started in one
> instance for every qrsh call you make on a randomly chosen port. Are
> these three programs in utilbin owned by root and have the suid set:
>
> -r-s--x--x  1 root root  26K 2005-12-09 13:41 rlogin
> -r-s--x--x  1 root root  20K 2005-12-09 13:41 rsh
> ...
> -r-s--x--x  1 root root  22K 2005-12-09 13:41 testsuidroot
>
> -- Reuti
>
>
>> When I'm not using SGE and running the system's rshd daemon it was
>> necessary to setup <HOME>/.rhosts so the users could run on parallel
>> machines without using a password.
>>
>>                             Best Regards,
>>                                Jerry
>>
>>
>> P.S. It was working with rshd (system) and .rhosts and/or
>> sshd/authorized_keys2. But I don't think that's in tight integration.
>>
>> P.S.S Perhaps it would help if I ran it without mpirun?
>>
>> P.S.S.S Just babbling at the moment.
>>
>>> Am 02.04.2006 um 14:57 schrieb Jerry Mersel:
>>>
>>>> I looked into <tmp/some_dir/> there the links and machinefile
>>>> have been set up correctly.
>>>>
>>>> What is causing the trouble is "Connection refused".
>>>>
>>>> When I was using the system rsh (along with .rhosts) I was able to
>>>> make
>>>> the connection.
>>>>
>>>> Now using SGE rshd with .rhosts or without I get Connection refused.
>>>
>>> Are you running the jobs as root? I never had to put something
>>> into .rhosts for each of my individual users.
>>>
>>> -- Reuti
>>>
>>>
>>>>                          Regards,
>>>>                            Jerry
>>>>
>>>>
>>>>
>>>>> Am 31.03.2006 um 14:04 schrieb Jerry Mersel:
>>>>>
>>>>>> I rebuilt it, but it didn't help:
>>>>>>
>>>>>> here are the results:
>>>>>>
>>>>>> error file:
>>>>>>
>>>>>> connect to address 192.168.1.3: Connection refused
>>>>>> connect to address 192.168.1.3: Connection refused
>>>>>> trying normal rsh (/usr/bin/rsh)
>>>>>
>>>>> The question is, whether the directory with the rsh-wrapper was
>>>>> correctly setup on the slave node. Just submit a parallel job,
>>>>> put a
>>>>> sleep 600 or so in the job script (instead of any mpirun command),
>>>>> and check whether the /wiccusers/mlmersel/mlmersel/mpi/startmpi.sh
>>>>> created the correct machinefile and the correct link to the rsh-
>>>>> wrapper to the /wiccusers/mlmersel/mlmersel/mpi/rsh on the master-
>>>>> node of the parallel job.
>>>>>
>>>>> BTW: Any firewall on the slave nodes?
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> wiccopt-3.weizmann.ac.il: Connection refused
>>>>>>
>>>>>> standard output file:
>>>>>>
>>>>>> p0_11663:  p4_error: Child process exited while making
>>>>>> connection to
>>>>>> remote proc
>>>>>> ess on wiccopt-3: 0
>>>>>> p0_11663: (33.023438) net_send: could not write to fd=4, errno
>>>>>> = 32
>>>>>>
>>>>>>                           Regards,
>>>>>>                              Jerry
>>>>>>
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Am 30.03.2006 um 16:34 schrieb Jerry Mersel:
>>>>>>>
>>>>>>>> Thanks Reuti:
>>>>>>>>
>>>>>>>>   You could probably tell I'm getting kinda desparate with this,
>>>>>>>>   and so I am.
>>>>>>>>
>>>>>>>>   I stopped running the system rshd.
>>>>>>>>   I tried it with a conventional switch, and with MPICH that
>>>>>>>> doesn't come
>>>>>>>>   from voltaire, and built it with <SGEROOT>/utilbin/lx24-amd64/
>>>>>>>> rsh as
>>>>>>>>   the RSHCOMMAND.
>>>>>>>>
>>>>>>>
>>>>>>> no, this way the wrapper won't work. Please recompile it with a
>>>>>>> simple switch:
>>>>>>>
>>>>>>> -rsh=rsh
>>>>>>>
>>>>>>> in the MPICH ./configure - although it's deprecated.
>>>>>>>
>>>>>>> Cheers - Reuti
>>>>>>>
>>>>>>>
>>>>>>>>   If I run on one node it works, (doesn't run on master), on
>>>>>>>> 2 or
>>>>>>>> more
>>>>>>>>   I get Connection refused.
>>>>>>>>
>>>>>>>>   I have looked into <sge>/mpi.
>>>>>>>>
>>>>>>>>   The PE that I am using:
>>>>>>>>
>>>>>>>> pe_name           mlmersel
>>>>>>>> slots             999
>>>>>>>> user_lists        NONE
>>>>>>>> xuser_lists       NONE
>>>>>>>> start_proc_args   /wiccusers/mlmersel/mlmersel/mpi/startmpi.sh -
>>>>>>>> catch_rsh
>>>>>>>> $pe_hostfile
>>>>>>>> stop_proc_args    /wiccusers/mlmersel/mlmersel/mpi/stopmpi.sh
>>>>>>>> allocation_rule   $round_robin
>>>>>>>> control_slaves    TRUE
>>>>>>>> job_is_first_task FALSE
>>>>>>>> urgency_slots     min
>>>>>>>>
>>>>>>>>
>>>>>>>> The queue config:
>>>>>>>>
>>>>>>>>   qname                 all.q
>>>>>>>> hostlist              @allhosts
>>>>>>>> seq_no                0
>>>>>>>> load_thresholds       np_load_avg=1.75
>>>>>>>> suspend_thresholds    NONE
>>>>>>>> nsuspend              1
>>>>>>>> suspend_interval      00:05:00
>>>>>>>> priority              0
>>>>>>>> min_cpu_interval      00:05:00
>>>>>>>> processors            UNDEFINED
>>>>>>>> qtype                 BATCH INTERACTIVE
>>>>>>>> ckpt_list             NONE
>>>>>>>> pe_list               jerry mlmersel mpi mymake
>>>>>>>> rerun                 FALSE
>>>>>>>> slots                 2,[wiccopt-2.weizmann.ac.il=2], \
>>>>>>>>                       [wiccopt-3.weizmann.ac.il=2], \
>>>>>>>>       %2
>>>>>>
>>>>>> ------------------------------------------------------------------
>>>>>> --
>>>>>> -
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-
>>>>>> help at gridengine.sunsource.net
>>>>>
>>>>> -------------------------------------------------------------------
>>>>> --
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-
>>>>> help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>
>>>>
>>>> --------------------------------------------------------------------
>>>> -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list