[GE users] abaqus issues on ge 6.2u5
kasper.fischer at ruhr-uni-bochum.de
Fri May 28 14:23:30 BST 2010
Hi Len Zaifman,
I am running Abaqus in a SGE cluster very successfully. I followed the instructions found on the Abaqus Support pages to create a custom abaqus_v6.env file adopted to our cluster setup, which includes the "Olesen" flexlm setup to manage the licenses.
Of course it is very hard to tell what's going wrong without having any error messages. I would try to qrsh or qlogin into the machine and to run a small Abaqus job manually. Check whether it is running and can contact the license server. Depending on your setup the error messages might appear in either the SGE log files (*.e*, *.o* files) or the abaqus log files (*.log, *.msg, *.dat) in the working directory or your home directory.
Am 27.05.10 18:52, schrieb leonardz:
> We are trying to see why an abaqus job is failing.
> We submit the job requesting 36 GB of memery as -l h_vmem=36g - this wil run on our 128GB node so we should be okay.
> the job run through an initialisation phase and then fails without writing any error logs, including the job's .e and .o file. The only t hing we have to go on is the qacct record: that is below.
> When I look up exit status=1 on the ge webpage (
> I see
> 1 Presumably before job f Job could not be started
> As you can see below the job ran for 16 minutes and used only 4.25 gb, about 12% of the requested amount.
> We don't get any idea from abaqus of what went wrong. The messages file on the compute node shows no messages.
> The qacct output below doesn't seem to indicate what the problem is (that I can see). error 1 seems to indicate the job did not start, but it ran for 16 minutes.
> Does anybody run sge and abaqus? Any ideas on where to look?
> or Alternatively:
> Anyone spot a clue in the qacct message?
> qacct -j 157741
> qname abaqus.q
> hostname cn-r5-34
> jobnumber 157741
> taskid undefined
> account sge
> priority 0
> qsub_time Thu May 27 10:46:47 2010
> start_time Thu May 27 10:46:55 2010
> end_time Thu May 27 11:15:36 2010
> granted_pe NONE
> slots 1
> failed 0
> exit_status 1
> ru_wallclock 1721
> ru_utime 980.909
> ru_stime 150.449
> ru_maxrss 0
> ru_ixrss 0
> ru_ismrss 0
> ru_idrss 0
> ru_isrss 0
> ru_minflt 6855375
> ru_majflt 0
> ru_nswap 0
> ru_inblock 0
> ru_oublock 0
> ru_msgsnd 0
> ru_msgrcv 0
> ru_nsignals 0
> ru_nvcsw 232019
> ru_nivcsw 146249
> cpu 1131.359
> mem 1690.376
> io 0.000
> iow 0.000
> maxvmem 4.273G
> arid undefined
> Len Zaifman
> Systems Manager, High Performance Systems
> The Centre for Computational Biology
> The Hospital for Sick Children
> 555 University Ave.
> Toronto, Ont M5G 1X8
> tel: 416-813-5513
> email: leonardz at sickkids.ca
> This e-mail may contain confidential, personal and/or health information(information which may be subject to legal restrictions on use, retention and/or disclosure) for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this e-mail in error, please contact the sender and delete all copies.
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users