[GE users] JSV scripts running unreliably

ah_sunsource ahaupt at ifh.de
Wed Jun 10 11:15:48 BST 2009


Hi,

I'm experiencing a bit with the new jsv feature in SGE 6.2u2. I've
written a server side jsv that checks whether the user requests at least
256M for h_vmem (below that, the prolog script might die due to missing
memory and leaving the queue in an error state).

Unfortunately the jsv feature is not reliable:

[oreade38] ~ % for i in {1..5}; do               
echo hostname | qsub -l h_vmem=128M              
done
Unable to run job: Do not require less than 256M for h_vmem.
Exiting.
Unable to run job: Do not require less than 256M for h_vmem.
Exiting.
Unable to run job: master got unknown command from JSV: "ERROR".
Exiting.
Unable to run job: master got unknown command from JSV: "ERROR".
Exiting.
Unable to run job: Do not require less than 256M for h_vmem.
Exiting.

On the server logs I see messages like this:

06/10/2009 11:30:35|worker|lolek-vm1|I|JSV modification time in "worker001" has changed
06/10/2009 11:30:36|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier" has been stopped
06/10/2009 11:30:36|worker|lolek-vm1|I|JSV modification time in "worker001" has changed
06/10/2009 11:30:36|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier" has been started
06/10/2009 11:30:37|worker|lolek-vm1|I|JSV "worker001" rejected job 921
06/10/2009 11:30:37|worker|lolek-vm1|I|JSV modification time in "worker000" has changed
06/10/2009 11:30:37|worker|lolek-vm1|I|JSV modification time in "worker000" has changed
06/10/2009 11:30:37|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier" has been started
06/10/2009 11:30:37|worker|lolek-vm1|I|JSV "worker000" rejected job 922
06/10/2009 11:30:37|worker|lolek-vm1|I|JSV "worker001" rejected job 923
06/10/2009 11:30:37|worker|lolek-vm1|I|JSV "worker001" will be restarted.
06/10/2009 11:30:38|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier" has been stopped
06/10/2009 11:30:38|worker|lolek-vm1|I|JSV "worker000" rejected job 924
06/10/2009 11:30:38|worker|lolek-vm1|I|JSV "worker000" will be restarted.
06/10/2009 11:30:39|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier" has been stopped
06/10/2009 11:30:39|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier" has been started
06/10/2009 11:30:40|worker|lolek-vm1|I|JSV "worker001" rejected job 925

Looks like the success of the script is oscillating. Is it be a bug?

Cheers,
Andreas
-- 
| Andreas Haupt             | E-Mail: andreas.haupt at desy.de
|  DESY Zeuthen             | WWW:    http://www-zeuthen.desy.de/~ahaupt
|  Platanenallee 6          | Phone:  +49/33762/7-7359
|  D-15738 Zeuthen          | Fax:    +49/33762/7-7216

-- 
| Andreas Haupt             | E-Mail: andreas.haupt at desy.de
|  DESY Zeuthen             | WWW:    http://www-zeuthen.desy.de/~ahaupt
|  Platanenallee 6          | Phone:  +49/33762/7-7359
|  D-15738 Zeuthen          | Fax:    +49/33762/7-7216

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=201408

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list