[GE users] eqw problem.

jiangfan shi jiangfan.shi at gmail.com
Fri Sep 7 21:03:57 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

The problem is solved now. Some of the nodes have been dropping the /home
folder, and this is the reason for changing directory.

Thanks for everybody to give me the hints.

Good weekend.

Jinagfan

On 9/7/07, Beadles, Jeff <jeff_beadles at mentor.com> wrote:
>
>  It means that the execution host couldn't see /home/.../my-script-folder
>
> Run qacct -j job# to see what host the job failed on, and go check it to
> see what's wrong.
>
> (Is it just me, but isn't it odd that qstat tells you everything you want
> to know except where the failed job tried to run?)
>
> Jeff
>
> ------------------------------
> *From:* jiangfan shi [mailto:jiangfan.shi at gmail.com]
> *Sent:* Thu 9/6/2007 9:47 PM
> *To:* users at gridengine.sunsource.net
> *Subject:* Re: [GE users] eqw problem.
>
> Again, my eqw problem was not solved after I tried to choose some queue to
> run. By qstat -j jobid, I got the following :
>
>
> script_file:                /home/.../my.sh
> error reason    1:          09/06/2007 23:41:25 [7026:7382]: error: can't
> chdir to /home/.../my-script-folder
>
> What is this error?  Anyone can help me?
>
> Thanks.
>
> Jiangfan
>
>
> On 9/6/07, Nicholas Senedzuk < nicholas.senedzuk at gmail.com> wrote:
> >
> > What Eqw means is that the queue is in error state, thats the E, and is
> > in queue wait, thats the qw. The jobs will retry them selfs after a certain
> > amount of time if you have them configured to. What most likely is happening
> > is that you have one system that you are having a problem with so when a job
> > attempts to run on that system and errors out into the Eqw state another job
> > is dispatched to run on the system. So when you rerun these jobs they end up
> > running on another host that is not having a problem and going into r state.
> >
> >
> > So what I would recommend doing to finding the system/systems that are
> > having the problem and disable that node and then run all the jobs and see
> > what happens. If no jobs go into Eqw state then you found the problem and
> > you just need to find out why the jobs are not running on that node
> > correctly.
> >
> >
> > On 9/6/07, jiangfan shi < jiangfan.shi at gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I have a error of "eqw" when I use qstat to see the status of jobs.
> > > Some jobs are successfully going into "r" state, but some into "eqw" state.
> > > And when I run those jobs again, sometimes all jobs are going into "r"
> > > state, but most time there are always 3 or 8 going into "eqw" state.
> > >
> > > For the ex.out log information, I got the following:
> > >
> > > */bin/bash: /root/*.bashrc: Permission denied
> > > /home/grad/jfshi/sandbox /threshold/mini-threshold/maetg: error while
> > > loading shared libraries: libstdc++.so.6: cannot open sh
> > > ared object file: No such file or directory
> > >
> > >
> > > Originally I used the V flag with qsub to resolve such problem. It
> > > worked at that time. But now it gave me the "eqw" problem.
> > >
> > >  The following is the jobs information:
> > >
> > > 201036 0.00000 reuse-mini jfshi        Eqw   09/03/2007
> > > 21:28:30                                     1
> > > 201044 0.00000 reuse-mini jfshi        Eqw   09/03/2007 21:28:31
> > >
> > > Anyone can tell me the solution?
> > >
> > > Thanks.
> > >
> > > Jiangfan
> > >
> >
> >
>



More information about the gridengine-users mailing list