[GE users] eqw problem.

Wilfred Li wilfred at sdsc.edu
Fri Sep 7 07:50:25 BST 2007


I've seen this problem occurring when the home directories failed to
mount on compute nodes for various reasons. 

 

Just check and see "/home/.../my-script-folder" is accessible from the
nodes listed in the machinefile, or just check all the compute nodes.



Regards,

 

Wilfred

From: jiangfan shi [mailto:jiangfan.shi at gmail.com] 
Sent: Thursday, September 06, 2007 9:47 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] eqw problem.

 

Again, my eqw problem was not solved after I tried to choose some queue
to run. By qstat -j jobid, I got the following :


script_file:                /home/.../my.sh
error reason    1:          09/06/2007 23:41:25 [7026:7382]: error:
can't chdir to /home/.../my-script-folder 

What is this error?  Anyone can help me? 

Thanks.

Jiangfan



On 9/6/07, Nicholas Senedzuk < nicholas.senedzuk at gmail.com
<mailto:nicholas.senedzuk at gmail.com> > wrote:

What Eqw means is that the queue is in error state, thats the E, and is
in queue wait, thats the qw. The jobs will retry them selfs after a
certain amount of time if you have them configured to. What most likely
is happening is that you have one system that you are having a problem
with so when a job attempts to run on that system and errors out into
the Eqw state another job is dispatched to run on the system. So when
you rerun these jobs they end up running on another host that is not
having a problem and going into r state. 

So what I would recommend doing to finding the system/systems that are
having the problem and disable that node and then run all the jobs and
see what happens. If no jobs go into Eqw state then you found the
problem and you just need to find out why the jobs are not running on
that node correctly. 





On 9/6/07, jiangfan shi < jiangfan.shi at gmail.com
<mailto:jiangfan.shi at gmail.com> > wrote:

Hi,

I have a error of "eqw" when I use qstat to see the status of jobs. Some
jobs are successfully going into "r" state, but some into "eqw" state.
And when I run those jobs again, sometimes all jobs are going into "r"
state, but most time there are always 3 or 8 going into "eqw" state. 

For the ex.out log information, I got the following:

/bin/bash: /root/.bashrc: Permission denied 
/home/grad/jfshi/sandbox

/threshold/mini-threshold/maetg: error while loading shared libraries:
libstdc++.so.6: cannot open sh 
ared object file: No such file or directory


Originally I used the V flag with qsub to resolve such problem. It
worked at that time. But now it gave me the "eqw" problem. 

 The following is the jobs information: 

201036 0.00000 reuse-mini jfshi        Eqw   09/03/2007 21:28:30


              1 
201044 0.00000 reuse-mini jfshi        Eqw   09/03/2007 21:28:31 

Anyone can tell me the solution? 

Thanks.

Jiangfan

 

 




More information about the gridengine-users mailing list