[GE users] Spurious errors with checkpointing

Reuti reuti at staff.uni-marburg.de
Sun Dec 9 18:07:04 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

Am 05.12.2007 um 11:55 schrieb Ruppert:

> we use Gridengine 6.0u6 on Solaris10/Sparc and are occasionally
> getting errors "can't stat ... as stdout_path: Permission denied"
> from jobs which are being suspended in subordinate queues.
>
> The setup is the following: we have two queues per host, one for
> potentially long running jobs (t) and one for "immediate" jobs (b),  
> with
> t being subordinate to b. Thus, when a job in b starts, a job in t
> is being migrated to an other, usually less powerful machine by a
> checkpointing environment attached to t which does a kill -USR1 to
> the job's process group.
>
> All this works usually as it should, but from time to time a job in t
> goes into an error state with the above message. The stdout_path
> is, of course, owned and writable by the user of the job.
>
>> From the job trace (included below) I have the impression that the
> job gets checkpointed (and thus receives SIGUSR1) at a very early
> stage (before "setting limits"). A contributing factor might also be
> that we use NIS+.
>
> Does anybody have an idea what may trigger this? We get a few of these
> jobs in error states every day, and it starts to become annoying
> to manually clear these error states. I'd appreciate every suggestion
> what to look for or what to change in our setup.
>
> One representative job trace:
>
> Job 1924219 caused action: Job 1924219 set to ERROR
>  User        = xxx
>  Queue       = t at netra04
>  Host        = netra04
>  Start Time  = <unknown>
>  End Time    = <unknown>
> failed opening input/output file:12/05/2007 09:35:05 [2530:20240]:  
> can't stat()
> "/home/users/xxx/SGEtestjobs" as
> stdout_path: Perm
> Shepherd trace:
> 12/05/2007 09:34:35 [11368:20239]: shepherd called with uid = 0,  
> euid = 11368
> 12/05/2007 09:34:35 [11368:20239]: starting up 6.0u6
> 12/05/2007 09:34:35 [11368:20239]: setpgid(20239, 20239) returned 0
> 12/05/2007 09:34:35 [11368:20239]: no prolog script to start
> 12/05/2007 09:34:35 [11368:20239]: forked "job" with pid 20240
> 12/05/2007 09:34:35 [11368:20240]: pid=20240 pgrp=20240 sid=20240  
> old pgrp=20239
> getlogin()=<no login set>
> 12/05/2007 09:34:35 [11368:20240]: reading passwd information for  
> user 'xxx'
> 12/05/2007 09:34:35 [11368:20239]: kill -USR1 -$job_pid
> 12/05/2007 09:34:35 [11368:20240]: setosjobid: uid = 0, euid = 11368
> 12/05/2007 09:34:35 [11368:20239]: kill -USR1 -20240
> 12/05/2007 09:34:35 [11368:20239]: child: job - pid: 20240 -  
> ckpt_pid: 20240 -
> ckpt_interval: 0 - ckpt_signal 0
> 12/05/2007 09:34:35 [11368:20240]: setting limits
> 12/05/2007 09:34:35 [11368:20240]: RLIMIT_CPU setting: (soft
> 18446744073709551613 hard 18446744073709551613)
> resulting: (soft 18446744073709551613 hard 18446744073709551613)
> 12/05/2007 09:34:35 [11368:20240]: RLIMIT_FSIZE setting: (soft
> 18446744073709551613 hard 18446744073709551613)
> resulting: (soft 18446744073709551613 hard 18446744073709551613)
> 12/05/2007 09:34:35 [11368:20240]: RLIMIT_DATA setting: (soft
> 18446744073709551613 hard 18446744073709551613)
> resulting: (soft 18446744073709551613 hard 18446744073709551613)
> 12/05/2007 09:34:35 [11368:20240]: RLIMIT_STACK setting: (soft
> 18446744073709551613 hard 18446744073709551613)
> resulting: (soft 18446744073709551613 hard 18446744073709551613)
> 12/05/2007 09:34:35 [11368:20240]: RLIMIT_CORE setting: (soft
> 18446744073709551613 hard 18446744073709551613)
> resulting: (soft 18446744073709551613 hard 18446744073709551613)
> 12/05/2007 09:34:35 [11368:20240]: RLIMIT_VMEM setting: (soft
> 18446744073709551613 hard 18446744073709551613)
> resulting: (soft 18446744073709551613 hard 18446744073709551613)
> 12/05/2007 09:34:35 [11368:20240]: setting environment
> 12/05/2007 09:34:35 [11368:20240]: Initializing error file
> 12/05/2007 09:34:35 [11368:20240]: now doing chown(xxx) of trace  
> and error files
> 12/05/2007 09:34:35 [11368:20240]: switching to intermediate/target  
> user
> 12/05/2007 09:34:35 [2530:20240]: now running with uid=2530, euid=2530
> 12/05/2007 09:34:35 [2530:20240]: closing all filedescriptors
> 12/05/2007 09:34:35 [2530:20240]: further messages are in "error"  
> and "trace"
> 12/05/2007 09:34:51 [11368:20239]: wait3 returned -1
> 12/05/2007 09:34:51 [11368:20239]: mapped signal TTIN to signal  
> unknown signal
> 12/05/2007 09:34:51 [11368:20239]: queued signal unknown signal
> 12/05/2007 09:35:05 [11368:20239]: wait3 returned 20240 (status: 6656;
> WIFSIGNALED: 0,  WIFEXITED: 1,
> WEXITSTATUS: 26)
> 12/05/2007 09:35:05 [11368:20239]: job exited with exit status 26
> 12/05/2007 09:35:05 [11368:20239]: reaped "job" with pid 20240
> 12/05/2007 09:35:05 [11368:20239]: job exited not due to signal
> 12/05/2007 09:35:05 [11368:20239]: checkpointing job exited normally
> 12/05/2007 09:35:05 [11368:20239]: starting ckpt clean command
> 12/05/2007 09:35:05 [11368:20239]: starting ckpt clean command
> 12/05/2007 09:35:05 [11368:20239]: no checkpointing clean command  
> to start
> 12/05/2007 09:35:05 [11368:20239]: job exited with status 26
> 12/05/2007 09:35:05 [11368:20239]: now sending signal KILL to pid  
> -20240
> 12/05/2007 09:35:05 [11368:20239]: no tasker to notify
> 12/05/2007 09:35:05 [11368:20239]: failed starting job
> 12/05/2007 09:35:05 [11368:20239]: no epilog script to start
>
> Shepherd error:
> 12/05/2007 09:35:05 [2530:20240]: can't stat() "/home/users/xxx/ 
> SGEtestjobs" as
> stdout_path: Permission denied
> KRB5CCNAME=none uid=2530 gid=1500 1500 60003 20029

what checkpointing enviroment did you setup? Can you try to use  
another signal, which is not used by SGE already for other tasks?

-- Reuti


> Regards
> D.Ruppert
> ----------------------------------
> ePS & RTS Automation Software GmbH
> Benzstr. 1
> D-71272 Renningen
> Geschäftsführer: Gernot Kral, Frank Lubnau, Dieter Schneider
> Sitz der Gesellschaft: Renningen
> Registergericht: Leonberg HRB 3220
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list