[GE users] qrsh in MPICH PE problem

Alessandro Federico alessandro.federico at caspur.it
Wed Nov 24 15:59:25 GMT 2004


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

running an MPICH job on two nodes (dual Opteron) with the
rsh-wrapper (which uses qrsh) I get the following errors.

The qmaster logs the messages:

Wed Nov 24 12:48:27 2004|qmaster|poseidon|I|task 1.slacs16 at 
slacs16.caspur.it of job 3460.1 died through signal HUP
Wed Nov 24 12:48:27 2004|qmaster|poseidon|E|task 1.slacs16 of job 3460 
failed - killing job
Wed Nov 24 12:48:33 2004|qmaster|poseidon|I|task 1.slacs01 at 
slacs01.caspur.it of job 3460.1 died through signal KILL
Wed Nov 24 12:48:33 2004|qmaster|poseidon|W|job 3460.1 failed on host 
slacs01.caspur.it  assumedly
  after job because: job 3460.1 died through signal KILL (9)

the node slacs16 logs:

Wed Nov 24 12:48:29 2004|execd|slacs16|E|reaping job "3460" ptf 
complains: Job does not exist

but the directory 
$SGE_ROOT/cell_name/spool/slacs16/active_jobs/3460.1/1.slacs16/
exist.

If I run the same job on one node everything is OK.

Thanks.

ale




    [ Part 2: "Attached Text" ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list