[GE users] switched to 6.1 and many problems

Harald Pollinger Harald.Pollinger at Sun.COM
Thu Aug 9 12:40:34 BST 2007


    [ The following text is in the "ISO-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Does anybody know if there is a way to force qsub to submit qrsh jobs?
I wouldn't know any...

On the submit host, you could
# . $SGE_ROOT/util/dl.sh
# dl 1
# qsub

then enter "id" and press Ctrl-D. Send us the whole output

Then start
# qrsh id

and send the whole output, too.
This will tell us what really happens on the submit host.


Regards,
Harald


Yifan Zhang wrote:
> Hi,
> 
> Thanks for confirming that it is using qrsh to do qsub. I checked qsub, 
> it is not linked or aliased to qrsh.
> Here is what in my sge_request
> -cwd -e . -o . -V -S /bin/sh
> 
> Is there any way inside script I can check if my job is qsub-ed or qrsh-ed?
> 
> Thanks
> 
> Harald Pollinger wrote:
> 
>> Hi,
>>
>> this is the trace file of a qrsh job. Are you sure there is no 
>> link/alias/wrapper and the "qrsh" binary was not renamed or copied to 
>> "qsub"?
>> "qsub -help" prints what it really is.
>>
>>
>> > 08/08/2007 16:45:40 [0:15106]: error connecting stream socket: No
>> > route to host
>>
>> This is the reason why the job fails, all later error messages are 
>> only consecutive faults. "No route to host" is a system error returned 
>> by the connect(3SOCKET) call (it's errno 148), the execution host 
>> really can't connect to "munge.softsound.com".
>>
>> Regards,
>> Harald
>>
>>
>> Yifan Zhang wrote:
>>
>>> Hi,
>>>
>>> Thanks for helping. The ignore_fqdn was set to true. The funny thing 
>>> is the failure "can't open file /tmp/13851.1.real.q/pid: No such 
>>> file" does not really stop the job from finishing. I don't really 
>>> know the details of GE, does every qsub start as qlogin to execute jobs?
>>> And we have linux systems in different flavours, could that cause 
>>> problem? say submit host is FC and execution is SuSE
>>>
>>> Thanks
>>>
>>> Chansup Byun wrote:
>>>
>>>> Reuti wrote:
>>>>
>>>>> Am 08.08.2007 um 17:59 schrieb Yifan Zhang:
>>>>>
>>>>>> We were using SGE5.3, and it is been doing its job for years. Now 
>>>>>> we decided to switch to GE6.1. At begining it worked, and then for 
>>>>>> no reason every host will be set to error automatically. The job 
>>>>>> finished fine. It is a qsub job, not qrsh. please help if you know 
>>>>>> what is the reason causing this. The typical error message is as 
>>>>>> following:
>>>>>>
>>>>>> Job 13851 caused action: Queue "real.q at hal00.softsound.com" set to 
>>>>>> ERROR
>>>>>> User        = ajr
>>>>>> Queue       = real.q at hal00.softsound.com
>>>>>> Host        = hal00.softsound.com
>>>>>> Start Time  = <unknown>
>>>>>> End Time    = <unknown>
>>>>>> failed before job:08/08/2007 16:45:40 [0:15106]: can't open file 
>>>>>> /tmp/13851.1.real.q/pid: No such file or directory
>>>>>> Shepherd trace:
>>>>>> 08/08/2007 16:45:37 [16384:15105]: shepherd called with uid = 0, 
>>>>>> euid = 16384
>>>>>> 08/08/2007 16:45:37 [16384:15105]: setpgid(15105, 15105) returned 0
>>>>>> 08/08/2007 16:45:37 [16384:15105]: no prolog script to start
>>>>>> 08/08/2007 16:45:37 [16384:15105]: forked "job" with pid 15106
>>>>>> 08/08/2007 16:45:37 [16384:15106]: processing qlogin job
>>>>>> 08/08/2007 16:45:37 [16384:15106]: pid=15106 pgrp=15106 sid=15106 
>>>>>> old pgrp=15105 getlogin()=<no login set>
>>>>>> 08/08/2007 16:45:37 [16384:15105]: child: job - pid: 15106
>>>>>> 08/08/2007 16:45:37 [16384:15106]: reading passwd information for 
>>>>>> user 'root'
>>>>>> 08/08/2007 16:45:37 [16384:15106]: setting limits
>>>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_CPU setting: (soft 
>>>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>>>> 18446744073709551615 hard 18446744073709551615)
>>>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_FSIZE setting: (soft 
>>>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>>>> 18446744073709551615 hard 18446744073709551615)
>>>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_DATA setting: (soft 
>>>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>>>> 18446744073709551615 hard 18446744073709551615)
>>>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_STACK setting: (soft 
>>>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>>>> 18446744073709551615 hard 18446744073709551615)
>>>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_CORE setting: (soft 
>>>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>>>> 18446744073709551615 hard 18446744073709551615)
>>>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_VMEM/RLIMIT_AS setting: 
>>>>>> (soft 18446744073709551615 hard 18446744073709551615) resulting: 
>>>>>> (soft 18446744073709551615 hard 18446744073709551615)
>>>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_RSS setting: (soft 
>>>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>>>> 18446744073709551615 hard 18446744073709551615)
>>>>>> 08/08/2007 16:45:37 [16384:15106]: setting environment
>>>>>> 08/08/2007 16:45:37 [16384:15106]: Initializing error file
>>>>>> 08/08/2007 16:45:37 [16384:15106]: switching to 
>>>>>> intermediate/target user
>>>>>> 08/08/2007 16:45:37 [16519:15106]: closing all filedescriptors
>>>>>> 08/08/2007 16:45:37 [16519:15106]: further messages are in "error" 
>>>>>> and "trace"
>>>>>> 08/08/2007 16:45:37 [0:15106]: now running with uid=0, euid=0
>>>>>> 08/08/2007 16:45:37 [0:15106]: start qlogin
>>>>>> 08/08/2007 16:45:37 [0:15106]: calling 
>>>>>> qlogin_starter(/home/sound0/softsound/GridEngine/ssqueue/spool/hal00/active_jobs/13851.1, 
>>>>>> /usr/local/softsound/GridEngine/utilbin/lx24-amd64/rshd -l);
>>>>>> 08/08/2007 16:45:37 [0:15106]: uid = 0, euid = 0, gid = 0, egid = 0
>>>>>> 08/08/2007 16:45:37 [0:15106]: using sfd 1
>>>>>> 08/08/2007 16:45:37 [0:15106]: bound to port 52903
>>>>>> 08/08/2007 16:45:37 [0:15106]: write_to_qrsh - data = 
>>>>>> 0:52903:/usr/local/softsound/GridEngine/utilbin/lx24-amd64:/home/sound0/softsound/GridEngine/ssqueue/spool/hal00/active_jobs/13851.1:hal00.softsound.com 
>>>>>>
>>>>>> 08/08/2007 16:45:37 [0:15106]: write_to_qrsh - address = 
>>>>>> munge.softsound.com:33046
>>>>>> 08/08/2007 16:45:37 [0:15106]: write_to_qrsh - host = 
>>>>>> munge.softsound.com, port = 33046
>>>>>
>>>>>
>>>>>
>>>>> Your qmaster is hal00 and you want to access munge.softsound.com? 
>>>>> Something special with munge, is it showing up in qhost? Has this 
>>>>> node more than one name and/or network card?
>>>>
>>>>
>>>>
>>>> hal00 should be an execution host.
>>>> munge.softsound.com should be a submmit host.
>>>>
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> 08/08/2007 16:45:40 [0:15106]: error connecting stream socket: No 
>>>>>> route to host
>>>>>> 08/08/2007 16:45:40 [0:15106]: communication with qrsh failed
>>>>>> 08/08/2007 16:45:40 [0:15106]: forked "job" with pid 0
>>>>>> 08/08/2007 16:45:40 [0:15106]: child: job - pid: 0
>>>>>> 08/08/2007 16:45:40 [0:15106]: wait3 returned -1
>>>>>> 08/08/2007 16:45:40 [0:15106]: can't open file 
>>>>>> /tmp/13851.1.real.q/pid: No such file or directory
>>>>>> 08/08/2007 16:45:40 [0:15106]: write_to_qrsh - data = 1:can't open 
>>>>>> file /tmp/13851.1.real.q/pid: No such file or directory
>>>>>> 08/08/2007 16:45:40 [0:15106]: write_to_qrsh - address = 
>>>>>> munge.softsound.com
>>>>>> 08/08/2007 16:45:40 [0:15106]: illegal value for 
>>>>>> qrsh_control_port: "munge.softsound.com". Should be host:port
>>>>>
>>>>>
>>>>
>>>> How is the ignore_fqdn set?
>>>> The FQDN hostname may have confused SGE shepherd?
>>>>
>>>> Regards,
>>>>
>>>> - Chansup
>>>>
>>>>
>>>>>> 08/08/2007 16:45:40 [16384:15105]: wait3 returned 15106 (status: 
>>>>>> 2816; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 11)
>>>>>> 08/08/2007 16:45:40 [16384:15105]: job exited with exit status 11
>>>>>> 08/08/2007 16:45:40 [16384:15105]: reaped "job" with pid 15106
>>>>>> 08/08/2007 16:45:40 [16384:15105]: job exited not due to signal
>>>>>> 08/08/2007 16:45:40 [16384:15105]: job exited with status 11
>>>>>> 08/08/2007 16:45:40 [0:15105]: can't open file 
>>>>>> /tmp/13851.1.real.q/pid: No such file or directory
>>>>>> 08/08/2007 16:45:40 [0:15105]: write_to_qrsh - data = 1:can't open 
>>>>>> file /tmp/13851.1.real.q/pid: No such file or directory
>>>>>> 08/08/2007 16:45:40 [0:15105]: write_to_qrsh - address = 
>>>>>> munge.softsound.com:33046
>>>>>> 08/08/2007 16:45:40 [0:15105]: write_to_qrsh - host = 
>>>>>> munge.softsound.com, port = 33046
>>>>>> 08/08/2007 16:45:43 [0:15105]: error connecting stream socket: No 
>>>>>> route to host
>>>>>>
>>>>>> Shepherd error:
>>>>>> 08/08/2007 16:45:40 [0:15106]: can't open file 
>>>>>> /tmp/13851.1.real.q/pid: No such file or directory
>>>>>> 08/08/2007 16:45:40 [0:15105]: can't open file 
>>>>>> /tmp/13851.1.real.q/pid: No such file or directory
>>>>>>
>>>>>>
>>>>>> Thank you very much!
>>>>>>
>>>>>> -- 
>>>>>> /*
>>>>>> * Yifan Zhang
>>>>>> *
>>>>>> * Softsound
>>>>>> * +44 (0)1223 448 097
>>>>>> */
>>>>>>
>>>>>> "Autonomyï??s search technology is becoming a de facto standard 
>>>>>> for companiesï?? ï?? FT, 4th July 07
>>>>>>
>>>>>> Cream is a configuration of the Vim text editor that consists of a 
>>>>>> set of scripts which can be run within Vim to make it behave more 
>>>>>> like an editor now common to most personal computers which conform 
>>>>>> to the Common User Access standards of interface and operability.
>>>>>>
>>>>>> The information contained in this message is for the intended 
>>>>>> addressee only and may contain confidential and/or privileged  
>>>>>> information.  If you are not the intended addressee, please delete 
>>>>>> this message and notify the sender, and do not copy or distribute 
>>>>>> this message or disclose its contents to anyone.  Any views or 
>>>>>> opinions expressed in this message are those of the author and do 
>>>>>> not necessarily represent those of Autonomy Systems Limited or of 
>>>>>> any of its associated companies.  No reliance may be placed on 
>>>>>> this message without written confirmation from an authorised 
>>>>>> representative of the company.  Autonomy Systems Limited, 
>>>>>> Registered Office:  Cambridge Business Park, Cowley Road, 
>>>>>> Cambridge CB3 0WZ, Registered Number 03063054.
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
> 
> 


-- 
Sun Microsystems GmbH         Harald Pollinger
Dr.-Leo-Ritter-Str. 7         N1 Grid Engine Engineering
D-93049 Regensburg            Phone: +49 (0)941 3075-209  (x60209)
Germany                       Fax: +49 (0)941 3075-222  (x60222)
http://www.sun.com/gridware
mailto:harald.pollinger at sun.com
Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1,
D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list