[GE users] Sporadic errors in array tasks with a PE

kisielk kamil at zymeworks.com
Thu Apr 8 17:26:23 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

I noticed yesterday that some of my users' array tasks which use a PE were randomly failing. I'd get an email from SGE that looks like the following:

Job 60497 caused action: Queue "batch.q at node077.cluster" set to ERROR 
 User        = rosalia 
 Queue       = batch.q at node077.cluster 
 Start Time  = <unknown> 
 End Time    = <unknown> 
failed before job:04/07/2010 21:44:03 [1049:32249]: unable to find job file "/opt/sge/cluster/spool/node077/job_script 
Shepherd trace: 
04/07/2010 21:44:03 [0:32247]: shepherd called with uid = 0, euid = 0 
04/07/2010 21:44:03 [0:32247]: starting up 6.2u5 
04/07/2010 21:44:03 [0:32247]: setpgid(32247, 32247) returned 0 
04/07/2010 21:44:03 [0:32247]: do_core_binding: "binding" parameter not found in config file 
04/07/2010 21:44:03 [0:32247]: no prolog script to start 
04/07/2010 21:44:03 [0:32247]: /bin/true 
04/07/2010 21:44:03 [0:32247]: /bin/true 
04/07/2010 21:44:03 [0:32247]: parent: forked "pe_start" with pid 32248 
04/07/2010 21:44:03 [0:32247]: using signal delivery delay of 120 seconds 
04/07/2010 21:44:03 [0:32247]: parent: pe_start-pid: 32248 
04/07/2010 21:44:03 [0:32248]: child: starting son(pe_start, /bin/true, 0); 
04/07/2010 21:44:03 [0:32248]: pid=32248 pgrp=32248 sid=32248 old pgrp=32247 getlogin()=<no login set> 
04/07/2010 21:44:03 [0:32248]: reading passwd information for user 'rosalia' 
04/07/2010 21:44:03 [0:32248]: setting limits 
04/07/2010 21:44:03 [0:32248]: setting environment 
04/07/2010 21:44:03 [0:32248]: Initializing error file 
04/07/2010 21:44:03 [0:32248]: switching to intermediate/target user 
04/07/2010 21:44:03 [1049:32248]: closing all filedescriptors 
04/07/2010 21:44:03 [1049:32248]: further messages are in "error" and "trace" 
04/07/2010 21:44:03 [1049:32248]: using "/bin/bash" as shell of user "rosalia" 
04/07/2010 21:44:03 [1049:32248]: now running with uid=1049, euid=1049 
04/07/2010 21:44:03 [1049:32248]: execvp(/bin/true, "/bin/true") 
04/07/2010 21:44:03 [0:32247]: wait3 returned 32248 (status: 0; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0) 
04/07/2010 21:44:03 [0:32247]: pe_start exited with exit status 0 
04/07/2010 21:44:03 [0:32247]: reaped "pe_start" with pid 32248 
04/07/2010 21:44:03 [0:32247]: pe_start exited not due to signal 
04/07/2010 21:44:03 [0:32247]: pe_start exited with status 0 
04/07/2010 21:44:03 [0:32247]: parent: forked "job" with pid 32249 
04/07/2010 21:44:03 [0:32247]: parent: job-pid: 32249 
04/07/2010 21:44:03 [0:32249]: child: starting son(job, /opt/sge/cluster/spool/node077/job_scripts/60497, 0); 
04/07/2010 21:44:03 [0:32249]: pid=32249 pgrp=32249 sid=32249 old pgrp=32247 getlogin()=<no login set> 
04/07/2010 21:44:03 [0:32249]: reading passwd information for user 'rosalia' 
04/07/2010 21:44:03 [0:32249]: setosjobid: uid = 0, euid = 0 
04/07/2010 21:44:03 [0:32249]: setting limits 
04/07/2010 21:44:03 [0:32249]: RLIMIT_CPU setting: (soft 0^HINFINITY hard 0^HINFINITY) resulting: (soft 0^HINFINITY hard 0^HINFINITY) 
04/07/2010 21:44:03 [0:32249]: RLIMIT_FSIZE setting: (soft 0^HINFINITY hard 0^HINFINITY) resulting: (soft 0^HINFINITY hard 0^HINFINITY) 
04/07/2010 21:44:03 [0:32249]: RLIMIT_DATA setting: (soft 0^HINFINITY hard 0^HINFINITY) resulting: (soft 0^HINFINITY hard 0^HINFINITY) 
04/07/2010 21:44:03 [0:32249]: RLIMIT_STACK setting: (soft 0^HINFINITY hard 0^HINFINITY) resulting: (soft 0^HINFINITY hard 0^HINFINITY) 
04/07/2010 21:44:03 [0:32249]: RLIMIT_CORE setting: (soft 0^HINFINITY hard 0^HINFINITY) resulting: (soft 0^HINFINITY hard 0^HINFINITY) 
04/07/2010 21:44:03 [0:32249]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 4294967296 hard 4294967296) resulting: (soft 4294967296 hard 4294967296) 
04/07/2010 21:44:03 [0:32249]: RLIMIT_RSS setting: (soft 0^HINFINITY hard 0^HINFINITY) resulting: (soft 0^HINFINITY hard 0^HINFINITY) 
04/07/2010 21:44:03 [0:32249]: setting environment 
04/07/2010 21:44:03 [0:32249]: Initializing error file 
04/07/2010 21:44:03 [0:32249]: switching to intermediate/target user 
04/07/2010 21:44:03 [1049:32249]: closing all filedescriptors 
04/07/2010 21:44:03 [1049:32249]: further messages are in "error" and "trace" 
04/07/2010 21:44:03 [1049:32249]: now running with uid=1049, euid=1049 
04/07/2010 21:44:03 [1049:32249]: unable to find job file "/opt/sge/cluster/spool/node077/job_scripts/60497" 
04/07/2010 21:44:03 [0:32247]: wait3 returned 32249 (status: 2816; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 11) 
04/07/2010 21:44:03 [0:32247]: job exited with exit status 11 
04/07/2010 21:44:03 [0:32247]: reaped "job" with pid 32249 
04/07/2010 21:44:03 [0:32247]: job exited not due to signal 
04/07/2010 21:44:03 [0:32247]: job exited with status 11 
04/07/2010 21:44:03 [0:32247]: now sending signal KILL to pid -32249 
04/07/2010 21:44:03 [0:32247]: no tasker to notify 
04/07/2010 21:44:03 [0:32247]: failed starting job 
04/07/2010 21:44:03 [0:32247]: /bin/true 
04/07/2010 21:44:03 [0:32247]: /bin/true 
04/07/2010 21:44:03 [0:32247]: parent: forked "pe_stop" with pid 32250 
04/07/2010 21:44:03 [0:32247]: using signal delivery delay of 120 seconds 
04/07/2010 21:44:03 [0:32247]: parent: pe_stop-pid: 32250 
04/07/2010 21:44:03 [0:32250]: child: starting son(pe_stop, /bin/true, 0); 
04/07/2010 21:44:03 [0:32250]: pid=32250 pgrp=32250 sid=32250 old pgrp=32247 getlogin()=<no login set> 
04/07/2010 21:44:03 [0:32250]: reading passwd information for user 'rosalia' 
04/07/2010 21:44:03 [0:32250]: setting limits 
04/07/2010 21:44:03 [0:32250]: setting environment 
04/07/2010 21:44:03 [0:32250]: Initializing error file 
04/07/2010 21:44:03 [0:32250]: switching to intermediate/target user 
04/07/2010 21:44:03 [1049:32250]: closing all filedescriptors 
04/07/2010 21:44:03 [1049:32250]: further messages are in "error" and "trace" 
04/07/2010 21:44:03 [1049:32250]: using "/bin/bash" as shell of user "rosalia" 
04/07/2010 21:44:03 [1049:32250]: now running with uid=1049, euid=1049 
04/07/2010 21:44:03 [1049:32250]: execvp(/bin/true, "/bin/true") 
04/07/2010 21:44:03 [0:32247]: wait3 returned 32250 (status: 0; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0) 
04/07/2010 21:44:03 [0:32247]: pe_stop exited with exit status 0 
04/07/2010 21:44:03 [0:32247]: reaped "pe_stop" with pid 32250 
04/07/2010 21:44:03 [0:32247]: pe_stop exited not due to signal 
04/07/2010 21:44:03 [0:32247]: pe_stop exited with status 0 
04/07/2010 21:44:03 [0:32247]: no tasker to notify 
04/07/2010 21:44:03 [0:32247]: no epilog script to start 

Shepherd error: 
04/07/2010 21:44:03 [1049:32249]: unable to find job file "/opt/sge/cluster/spool/node077/job_scripts/60497" 

Shepherd pe_hostfile: 
node077.cluster 2 batch.q at node077.cluster UNDEFINED 
node069.cluster 1 batch.q at node069.cluster UNDEFINED 
node074.cluster 1 batch.q at node074.cluster UNDEFINED


The thing is, this doesn't happen for all of the array tasks, even ones which previously ran on the same node, so I'm not sure what is going on here. Maybe someone who's more familiar can help me figure out what is wrong?

The PE is for OpenMPI and is set up according to the instructions in their FAQ.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=252719

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list