[GE users] abaqus issues on ge 6.2u5

kasper_fischer kasper.fischer at ruhr-uni-bochum.de
Fri May 28 14:23:30 BST 2010


Hi Len Zaifman,

I am running Abaqus in a SGE cluster very successfully. I followed the instructions found on the Abaqus Support pages to create a custom abaqus_v6.env file adopted to our cluster setup, which includes the "Olesen" flexlm setup to manage the licenses.

Of course it is very hard to tell what's going wrong without having any error messages. I would try to qrsh or qlogin into the machine and to run a small Abaqus job manually. Check whether it is running and can contact the license server. Depending on your setup the error messages might appear in either the SGE log files (*.e*, *.o* files) or the abaqus log files (*.log, *.msg, *.dat) in the working directory or your home directory.

Greetings

Kasper




Am 27.05.10 18:52, schrieb leonardz:
> We are trying to see why an abaqus job is failing.
>
> We submit the job requesting 36 GB of memery as -l h_vmem=36g - this wil run on our 128GB node so we should be okay.
>
> the job run through an initialisation phase and then fails without writing any error logs, including the job's .e and .o file. The only t hing we have to go on is the qacct record: that is below.
>
> When I look up exit status=1 on the ge webpage (
> http://wikis.sun.com/display/gridengine62u5/Error+Messages)
>
> I see
>
> 1       Presumably before job   f       Job could not be started
>
> As you can see below the job ran for 16 minutes and used only 4.25 gb, about 12% of the requested amount.
>
> We don't get any idea from abaqus of what went wrong. The messages file on the compute node shows no messages.
>
> The qacct output below doesn't seem to indicate what the problem is (that I can see). error 1 seems to indicate the job did not start, but it ran for 16 minutes.
>
> Does anybody run sge and abaqus? Any ideas on where to look?
> or Alternatively:
> Anyone spot a clue in the qacct message?
> Thanks.
>
> qacct -j 157741
> qname        abaqus.q
> hostname     cn-r5-34
> .....
> jobnumber    157741
> taskid       undefined
> account      sge
> priority     0
> qsub_time    Thu May 27 10:46:47 2010
> start_time   Thu May 27 10:46:55 2010
> end_time     Thu May 27 11:15:36 2010
> granted_pe   NONE
> slots        1
> failed       0
> exit_status  1
> ru_wallclock 1721
> ru_utime     980.909
> ru_stime     150.449
> ru_maxrss    0
> ru_ixrss     0
> ru_ismrss    0
> ru_idrss     0
> ru_isrss     0
> ru_minflt    6855375
> ru_majflt    0
> ru_nswap     0
> ru_inblock   0
> ru_oublock   0
> ru_msgsnd    0
> ru_msgrcv    0
> ru_nsignals  0
> ru_nvcsw     232019
> ru_nivcsw    146249
> cpu          1131.359
> mem          1690.376
> io           0.000
> iow          0.000
> maxvmem      4.273G
> arid         undefined
>
>
>
> Len Zaifman
> Systems Manager, High Performance Systems
> The Centre for Computational Biology
> The Hospital for Sick Children
> 555 University Ave.
> Toronto, Ont M5G 1X8
>
> tel: 416-813-5513
> email: leonardz at sickkids.ca
>
> This e-mail may contain confidential, personal and/or health information(information which may be subject to legal restrictions on use, retention and/or disclosure) for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this e-mail in error, please contact the sender and delete all copies.
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=259062
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=259358

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list