[GE users] Strange issue with hanging jobs

Aaron Turner aaron at cs.york.ac.uk
Thu Nov 20 17:21:51 GMT 2008


Hello everyone,

I have a strange issue that is currently affecting one user and his jobs 
but has the potential to be greater.

Our cluster is running 6.1u4 (we have some scheduled downtime soon, so 
we have the opportunity to upgrade). It's a fairly ordinary installation 
on Scientific Linux 4.5 with fair sharing and nothing particularly 
special as far as I can see.

However, a user, when running a MatLab 2008 job using just the runtime 
has the problem that on our cluster it hangs, but only when running 
through Sun Grid Engine. By way of debugging I tried a vanilla install 
of 6.1u4 in a virtual machine using the same OS, and I have no problems 
with the job completing as intended. Unfortunately I can't seem to 
replicate the exact installation as on our qmaster into a virtual 
machine (it fails to start the master) to go further with the debugging. 
  I'd rather tweak settings in a VM version of the install rather than 
on the live system of course.

A parallel cluster with essentially the same software set but currently 
with 6.1u5 installed seems to also work as intended. The SGE 
configuration is virtually out-of-the-box on this system as users 
haven't really used it (and SGE will not be the main method for running 
jobs anyway) and so it hasn't been tuned until I get more of an idea of 
what users will want from this system.

The MatLab code is essentially a 'Hello World' piece of code. I can give 
an indication of what line the C code stops at if required. The code 
would ordinarily start up a GUI but has the sense to not bother when it 
finds there is no X environment on the systems that it does work on. On 
a system it does work on it completes in about 1 minute of elapsed time. 
On our main cluster it runs until the queuing system kills it after its 
alloted time in the queue has ended.

Has anyone seen anything similar with any hints as to what settings need 
to be tweaked, or a hint on how I can get the copy of the main 
environment running in a VM to aid debugging.

Ultimately we may well have more users running Matlab runtime based 
programs so it would be good for us to get this sorted out before that 
event!

Many thanks

   Aaron Turner

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89248

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list