[GE users] errors and usage

craffi dag at sonsorol.org
Wed Nov 26 17:23:53 GMT 2008


Hi Robert,

Looks like there is an issue with the MPI environment and/or the  
integration with SGE. Debugging MPI problems within SGE tends to be  
complicated, even for experienced SGE admins - it makes sense to  
compartmentalize and make things simple when trying to track down  
problems.

My first recommendation is to verify 100% that your MPI environment  
and application work well outside of Grid Engine. What this means in  
practice is that you should be using the "mpirun" script that came  
with your MPI install and you should be able to launch and run your  
job against many machines by simply feeding it a handcrafted machine/ 
hosts file.

Once you know your MPI environment works and that your application  
runs successfully within it you can then join it up with Grid Engine.  
Should problems occur at that point the root cause is often easier to  
learn because you'd have ruled out most of the common MPI, network or  
SSH related issues.

Regards,
Chris


On Nov 26, 2008, at 11:31 AM, Robert Fenwick wrote:

> Hi,
>
> I am new to the list and have been trying for a day to set things up,
> however I can not manage to get my head around the grid engine and how
> it works. I havemanaged to compile my code on what I think is the head
> node and now I want to run my job. The errors that I am getting are:
>
> xecluster tests/sys32rep_0.01> more error
> error: 1: can't open environment file: No such file or directory
>
> [n025:28765] ERROR: A daemon on node n025 failed to start as expected.
> [n025:28765] ERROR: There may be more information available from
> [n025:28765] ERROR: the 'qstat -t' command on the Grid Engine tasks.
> [n025:28765] ERROR: If the problem persists, please restart the
> [n025:28765] ERROR: Grid Engine PE job
> [n025:28765] ERROR: The daemon exited unexpectedly with status 1.
> can't open environment file: No such file or directory
> [n025:28765] ERROR: A daemon on node n028 failed to start as expected.
> [n025:28765] ERROR: There may be more information available from
> [n025:28765] ERROR: the 'qstat -t' command on the Grid Engine tasks.
> [n025:28765] ERROR: If the problem persists, please restart the
> [n025:28765] ERROR: Grid Engine PE job
> [n025:28765] ERROR: The daemon exited unexpectedly with status 1.
>
> and:
>
> -catch_rsh /opt/sge/XEC/spool/n025/active_jobs/25176.1/pe_hostfile
> n025
> n025
> n025
> n025
> n025
> n025
> n025
> n025
> n028
> n028
> n028
> n028
> n028
> n028
> n028
> n028
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> [1] 28765
>
> My input script looks like:
>
> #!/bin/csh
> #$ -N sys32rep_test
> #$ -pe mpi 16
> #$ -q lmb.q
> #$ -cwd
> #$ -i my.stdin
> #$ -o test.out
> #$ -e error
>
> mpirun -np 32 ./charmm &
>
>
> Any help or suggestions for further places to look would be most  
> welcome.
>
> Thank you in advance,
>
>
> Bryn
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89996
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=90004

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list