[GE users] Long delay when submitting large jobs

Rayson Ho raysonho at eseenet.com
Tue Feb 15 18:05:48 GMT 2005


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

>I'm curious whether PSCHED API actually did specify how task
>stdin/stdout/stderr shall be handled. Anyone knowing if there is
>a more recent version of PSCHED API than v0.1?
>
>   http://www.jrac.com/people/jjones/psched-api-report.ps

That's also the version I have...

However, I've read the PBS and mpiexec source, there are some minor
changes. Further, a few APIs defined in the document above are not
implemented in PBS/Torque.


>with mpiexec I encounter a process for reading and writing tasks
>stdin/stdout/stderr is forked-off. Though this is a solution but
>I'd prefer this be transparently done by an API for the task.

But I hope there is a way to turn that off. In some cases, users don't need
stdio, or may be just for the first MPI task?? (also, LAM doesn't need it)

And again, to use mpiexec's stdio method with SGE, we just need to add a
few lines of code in shepherd to connect stdio/err back to mpiexec.

(BTW, PBSPro didn't support connecting stdio/err back to mpiexec... that
was 2 years ago when they discovered this, not sure whether they have it
fixed or not)


I think providing a TM lib is not for adding any extra PE functionality,
but rather it is for supporting TM-based PBS parallel environments - and
let more PBS users to migrate to SGE? :)

There are still many gov labs/university supercomputer centers using
PBS/Torque, but I believe most compute farm users are using SGE (this is
good!).


(Back to the extra functionality stuff)
we can, however, modify the protocol to start the rshds in mpiexec and
offload qmaster. *However*, since it now takes qmaster less than 2 seconds
to contact 400 execds, this is not a problem anymore. (and not to mention
that it is a lot more secure to have qmaster to contact the execds to tell
them to accept a parallel slave job (TAG_SLAVE_ALLOW))

Rayson


>
>Andreas
>
---------------------------------------------------------
Get your FREE E-mail account at http://www.eseenet.com !

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list