No subject


Wed Jan 12 20:38:46 GMT 2011


job gets checkpointed (and thus receives SIGUSR1) at a very early 
stage (before "setting limits"). A contributing factor might also be
that we use NIS+.

Does anybody have an idea what may trigger this? We get a few of these
jobs in error states every day, and it starts to become annoying
to manually clear these error states. I'd appreciate every suggestion
what to look for or what to change in our setup.

One representative job trace:

Job 1924219 caused action: Job 1924219 set to ERROR
 User        = xxx
 Queue       = t at netra04
 Host        = netra04
 Start Time  = <unknown>
 End Time    = <unknown>
failed opening input/output file:12/05/2007 09:35:05 [2530:20240]: can't stat() 
"/home/users/xxx/SGEtestjobs" as 
stdout_path: Perm
Shepherd trace:
12/05/2007 09:34:35 [11368:20239]: shepherd called with uid = 0, euid = 11368
12/05/2007 09:34:35 [11368:20239]: starting up 6.0u6
12/05/2007 09:34:35 [11368:20239]: setpgid(20239, 20239) returned 0
12/05/2007 09:34:35 [11368:20239]: no prolog script to start
12/05/2007 09:34:35 [11368:20239]: forked "job" with pid 20240
12/05/2007 09:34:35 [11368:20240]: pid=20240 pgrp=20240 sid=20240 old pgrp=20239 
getlogin()=<no login set>
12/05/2007 09:34:35 [11368:20240]: reading passwd information for user 'xxx'
12/05/2007 09:34:35 [11368:20239]: kill -USR1 -$job_pid
12/05/2007 09:34:35 [11368:20240]: setosjobid: uid = 0, euid = 11368
12/05/2007 09:34:35 [11368:20239]: kill -USR1 -20240
12/05/2007 09:34:35 [11368:20239]: child: job - pid: 20240 - ckpt_pid: 20240 - 
ckpt_interval: 0 - ckpt_signal 0
12/05/2007 09:34:35 [11368:20240]: setting limits
12/05/2007 09:34:35 [11368:20240]: RLIMIT_CPU setting: (soft 
18446744073709551613 hard 18446744073709551613) 
resulting: (soft 18446744073709551613 hard 18446744073709551613)
12/05/2007 09:34:35 [11368:20240]: RLIMIT_FSIZE setting: (soft 
18446744073709551613 hard 18446744073709551613) 
resulting: (soft 18446744073709551613 hard 18446744073709551613)
12/05/2007 09:34:35 [11368:20240]: RLIMIT_DATA setting: (soft 
18446744073709551613 hard 18446744073709551613) 
resulting: (soft 18446744073709551613 hard 18446744073709551613)
12/05/2007 09:34:35 [11368:20240]: RLIMIT_STACK setting: (soft 
18446744073709551613 hard 18446744073709551613) 
resulting: (soft 18446744073709551613 hard 18446744073709551613)
12/05/2007 09:34:35 [11368:20240]: RLIMIT_CORE setting: (soft 
18446744073709551613 hard 18446744073709551613) 
resulting: (soft 18446744073709551613 hard 18446744073709551613)
12/05/2007 09:34:35 [11368:20240]: RLIMIT_VMEM setting: (soft 
18446744073709551613 hard 18446744073709551613) 
resulting: (soft 18446744073709551613 hard 18446744073709551613)
12/05/2007 09:34:35 [11368:20240]: setting environment
12/05/2007 09:34:35 [11368:20240]: Initializing error file
12/05/2007 09:34:35 [11368:20240]: now doing chown(xxx) of trace and error files
12/05/2007 09:34:35 [11368:20240]: switching to intermediate/target user
12/05/2007 09:34:35 [2530:20240]: now running with uid=2530, euid=2530
12/05/2007 09:34:35 [2530:20240]: closing all filedescriptors
12/05/2007 09:34:35 [2530:20240]: further messages are in "error" and "trace"
12/05/2007 09:34:51 [11368:20239]: wait3 returned -1
12/05/2007 09:34:51 [11368:20239]: mapped signal TTIN to signal unknown signal
12/05/2007 09:34:51 [11368:20239]: queued signal unknown signal
12/05/2007 09:35:05 [11368:20239]: wait3 returned 20240 (status: 6656; 
WIFSIGNALED: 0,  WIFEXITED: 1, 
WEXITSTATUS: 26)
12/05/2007 09:35:05 [11368:20239]: job exited with exit status 26
12/05/2007 09:35:05 [11368:20239]: reaped "job" with pid 20240
12/05/2007 09:35:05 [11368:20239]: job exited not due to signal
12/05/2007 09:35:05 [11368:20239]: checkpointing job exited normally
12/05/2007 09:35:05 [11368:20239]: starting ckpt clean command
12/05/2007 09:35:05 [11368:20239]: starting ckpt clean command
12/05/2007 09:35:05 [11368:20239]: no checkpointing clean command to start
12/05/2007 09:35:05 [11368:20239]: job exited with status 26
12/05/2007 09:35:05 [11368:20239]: now sending signal KILL to pid -20240
12/05/2007 09:35:05 [11368:20239]: no tasker to notify
12/05/2007 09:35:05 [11368:20239]: failed starting job
12/05/2007 09:35:05 [11368:20239]: no epilog script to start

Shepherd error:
12/05/2007 09:35:05 [2530:20240]: can't stat() "/home/users/xxx/SGEtestjobs" as 
stdout_path: Permission denied 
KRB5CCNAME=none uid=2530 gid=1500 1500 60003 20029 

Regards
D.Ruppert
----------------------------------
ePS & RTS Automation Software GmbH
Benzstr. 1
D-71272 Renningen
Geschäftsführer: Gernot Kral, Frank Lubnau, Dieter Schneider
Sitz der Gesellschaft: Renningen
Registergericht: Leonberg HRB 3220

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list