[GE users] BLCR with SGE?

Rayson Ho raysonho at eseenet.com
Tue Feb 22 18:33:58 GMT 2005


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

>On thing with MPICH: how to send a synchronous signal to all nodes "do a
>checkpoint now!". When the daemon in LAM/MPI support it, it's fine.

When an MPICH job is run in tight PE integration mode, we can send a chkpnt
signal to all the task...

And you are right, if the user runs mpirun on the command line, not under
SGE (or other kinds of daemons to start the tasks), then it is a bit hard
to send signals to the tasks on remote machines...

>It's not SGE related. Myrinet is also stating on their website, that
>checkpointing is not supported.

Good, then I don't need to worry about it at all :)


>Rayson: It's good that you are working on a TM, I was just thinking in
>digging into LAM/MPI to get a SGE module, but this way I will wait,
>because it's of course the cleaner solution.

Exactly! Also, we don't need to setuid root rsh if we use TM/mpiexec...
(good news to NFS users)

But the TM lib won't come out anytime soon. (see below)

>It was also just on the LAM
>list (or where?), that the TM of OpenPBS is not exactly working
>according to the docs, and that they programmed around it a little bit.
>So the LAM/MPI with TM will not work when you implement TM according to
>the docs - only to warn you - I can't find it again, but I red it
somewhere.

I'm sure there are complains about TM... It's an old API without much
update for years :(

May be you read that from my email??
http://gridengine.sunsource.net/servlets/ReadMsg?
msgId=24882&listName=users

I was reading the mpiexec source to find out how it expects from the TM
lib, and currently I am trying to find out what we need to modify in the
SGE source...

First thing: instead of starting rshd as root, we need to start the real
slave tasks as the job owner -- this is a bigger change than forwarding the
stdio stuff.

Rayson


>
>Maybe you can switch/add to the LAM/MPI-dev list at a later point in
>time for this point.
>
>Cheers - Reuti
---------------------------------------------------------
Get your FREE E-mail account at http://www.eseenet.com !

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list