[GE users] Issues with sge_commd

Bernard Li bli at bcgsc.ca
Thu Jul 15 18:30:38 BST 2004


Hi list:

Recently we have been having some issues with high load on our headnode
with sge_commd stuck at 99%.  Occassionally I can do a softstop and
bring up rcsge again and that would solve the problem.  However, most of
the time sge_commd continues to be stuck.

We have been running SGE 5.3p5 happily for a quite a while now and did
not have many problems (except for the stale file-handle issue).  We
have recently added about 70 more nodes to our cluster and the stability
has been somewhat spotty.

SGE is installed on the local disk instead of via a NFS share, and we
are suspecting that this might be the cause but we are not entirely
sure.  Another note is that most of our jobs are array jobs - we have
written a script to convert batch jobs into array jobs, so users submit
a lot of array jobs (100+ jobs each).

We have tried doing sgecommdcntl -d but didn't notice anything out of
the ordinary.

If you have any ideas, please let us know.

Thanks,

Bernard



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list