[GE users] Problems with large array job

Bernard Li bli at bcgsc.ca
Thu Aug 26 01:39:39 BST 2004


Hi list:

We had a user who submitted an array job with 26,000 tasks and sge_commd
continued to eat up CPU until it was stuck at 99%, at which point the
cluster becomes irresponsive.

We are uncertain of what the problem is - do people have any experience
running large array jobs?

In the /opt/sge/default/spool/qmaster/messages file, we noticed a lot
of:

Wed Aug 25 12:54:04 2004|qmaster|headnode|W|failed to deliver job
157125.1598 to queue "client1.q"

We are not using shared NFS for $SGE_ROOT, but instead, each client node
have its own installation in /opt/sge.

Thanks in advance for any advice/suggestions.

Cheers,

Bernard

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list