[GE users] nodes overloaded: processes placed on already full nodes

steve_s elcortogm at googlemail.com
Tue Dec 21 18:21:48 GMT 2010


On Dec 21 18:22 +0100, reuti wrote:
> >> What does:
> >> 
> >> ps -e f
> >> 
> >> (f w/o -) show on such a node? Are all the processes bound to an
> >> sge_shepherd, or did some jump out of the processes tree and weren't
> >> killed?
> > 
> > There are no sge_shepherd's on the nodes. I did not set up SGE on the
> > machine but what I understand from the documentation is that
> > sge_shepherd is only used in the case of "tight integration" of PEs.
> > In our case, the PE starts the MPI processes.
> 
> Well, even with a loose integration, you have to honor the lost of
> granted machines for your job. What do you mean in detail by "the PE
> starts the MPI processes"? You will need at least a sgeexecd on the
> nodes, so that SGE is aware of its existence and can make a suitable
> slot allocation for your job. (The sgeexecd will then start the
> shepherd in case of a tight integration.)

Yes, sge_execd is present on each node, as well as sge_shepherd-$JOB_ID
on the master node, where the job-script is executed:

 4693 ?        Sl    33:32 /cm/shared/apps/sge/current/bin/lx26-amd64/sge_execd
12165 ?        S      0:00  \_ sge_shepherd-60013 -bg
12389 ?        S      0:00                  \_ python /cm/shared/apps/intel/impi/3.2.2.006/bin64/mpiexec ....


Apparently, we have tight integration then. I did look for sge_shepherd
on the wrong node (not the master node). This is the first time I take a
closer look at these daemons, that's why a little confusion here (we got
the machine pre-configured and all, getting familiar with the system
always takes a factor of pi longer than expected). Sorry for the noise.

Now that we know what to look for, we can search for jobs which do not
behave.

best,
Steve

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307950

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list