[GE users] Strange issue with hanging jobs
aaron at cs.york.ac.uk
Thu Nov 20 17:21:51 GMT 2008
I have a strange issue that is currently affecting one user and his jobs
but has the potential to be greater.
Our cluster is running 6.1u4 (we have some scheduled downtime soon, so
we have the opportunity to upgrade). It's a fairly ordinary installation
on Scientific Linux 4.5 with fair sharing and nothing particularly
special as far as I can see.
However, a user, when running a MatLab 2008 job using just the runtime
has the problem that on our cluster it hangs, but only when running
through Sun Grid Engine. By way of debugging I tried a vanilla install
of 6.1u4 in a virtual machine using the same OS, and I have no problems
with the job completing as intended. Unfortunately I can't seem to
replicate the exact installation as on our qmaster into a virtual
machine (it fails to start the master) to go further with the debugging.
I'd rather tweak settings in a VM version of the install rather than
on the live system of course.
A parallel cluster with essentially the same software set but currently
with 6.1u5 installed seems to also work as intended. The SGE
configuration is virtually out-of-the-box on this system as users
haven't really used it (and SGE will not be the main method for running
jobs anyway) and so it hasn't been tuned until I get more of an idea of
what users will want from this system.
The MatLab code is essentially a 'Hello World' piece of code. I can give
an indication of what line the C code stops at if required. The code
would ordinarily start up a GUI but has the sense to not bother when it
finds there is no X environment on the systems that it does work on. On
a system it does work on it completes in about 1 minute of elapsed time.
On our main cluster it runs until the queuing system kills it after its
alloted time in the queue has ended.
Has anyone seen anything similar with any hints as to what settings need
to be tweaked, or a hint on how I can get the copy of the main
environment running in a VM to aid debugging.
Ultimately we may well have more users running Matlab runtime based
programs so it would be good for us to get this sorted out before that
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users