[GE users] wait3 returned -1 / now sending signal CONT to pid -3600

Sebastian Stark stark at tuebingen.mpg.de
Wed Nov 30 15:02:40 GMT 2005


I'm getting lots of these errors. What could be the cause? It only happens with some jobs.

Thanks for any hints.


-Sebastian

-------------------------------------------------------------------------------
To: stark at tuebingen.mpg.de
Subject: SGE 6.0u4: Job 649640 failed
From: <>

Job 649640 caused action: none
 User        = schweike
 Queue       = all.q at node112
 Host        = node112
 Start Time  = <unknown>
 End Time    = <unknown>
failed before writing exit_status:shepherd exited with exit status 19
Shepherd trace:
11/26/2005 15:10:29 [4399:3599]: shepherd called with uid = 0, euid = 4399
11/26/2005 15:10:30 [4399:3599]: setpgid(3599, 3599) returned 0
11/26/2005 15:10:30 [4399:3599]: no prolog script to start
11/26/2005 15:10:30 [4399:3600]: pid=3600 pgrp=3600 sid=3600 old pgrp=3599 getlogin()=<no login set>
11/26/2005 15:10:30 [4399:3599]: forked "job" with pid 3600
11/26/2005 15:10:30 [4399:3599]: child: job - pid: 3600
11/26/2005 15:10:30 [4399:3600]: setosjobid: uid = 0, euid = 4399
11/26/2005 15:10:30 [4399:3600]: RLIMIT_CPU setting: (soft 18446744073709551615 hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
+18446744073709551615)
11/26/2005 15:10:30 [4399:3600]: RLIMIT_FSIZE setting: (soft 18446744073709551615 hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
+18446744073709551615)
11/26/2005 15:10:30 [4399:3600]: RLIMIT_DATA setting: (soft 6291456000 hard 6291456000) resulting: (soft 6291456000 hard 6291456000)
11/26/2005 15:10:30 [4399:3600]: RLIMIT_STACK setting: (soft 8388608 hard 8388608) resulting: (soft 8388608 hard 8388608)
11/26/2005 15:10:30 [4399:3600]: RLIMIT_CORE setting: (soft 0 hard 0) resulting: (soft 0 hard 0)
11/26/2005 15:10:30 [4399:3600]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 6291456000 hard 6291456000) resulting: (soft 6291456000 hard 6291456000)
11/26/2005 15:10:30 [4399:3600]: RLIMIT_RSS setting: (soft 18446744073709551615 hard 18446744073709551615) resulting: (soft 18446744073709551615 hard
+18446744073709551615)
11/26/2005 15:10:30 [1840:3600]: closing all filedescriptors
11/26/2005 15:10:30 [1840:3600]: further messages are in "error" and "trace"
11/26/2005 15:10:30 [1840:3600]: using stdout as stderr
11/26/2005 15:10:30 [1840:3600]: execvp(/usr/local/sge/default/spool/node112/job_scripts/649640, "/usr/local/sge/default/spool/node112/job_scripts/649640")
11/26/2005 15:11:27 [4399:3599]: wait3 returned -1
11/26/2005 15:11:27 [4399:3599]: queued signal CONT
11/26/2005 15:11:27 [4399:3599]: kill(-3600, CONT)
11/26/2005 15:11:27 [4399:3599]: now sending signal CONT to pid -3600
11/26/2005 15:12:27 [4399:3599]: wait3 returned -1
11/26/2005 15:12:27 [4399:3599]: queued signal CONT
11/26/2005 15:12:27 [4399:3599]: kill(-3600, CONT)
11/26/2005 15:12:27 [4399:3599]: now sending signal CONT to pid -3600
11/26/2005 15:13:27 [4399:3599]: wait3 returned -1
11/26/2005 15:13:27 [4399:3599]: queued signal CONT
11/26/2005 15:13:27 [4399:3599]: kill(-3600, CONT)
11/26/2005 15:13:27 [4399:3599]: now sending signal CONT to pid -3600
[...same again every minute for hours and hours...]

-------------------------------------------------------------------------------

-- 
Sebastian Stark -- http://www.kyb.tuebingen.mpg.de/~stark
Max Planck Institute for Biological Cybernetics

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list