bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request


From: Stephen Wells
Subject: Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?
Date: Tue, 31 Mar 2015 23:16:14 +0100

Hi Tim,

Sorry for the ambiguity. To be more specific, the file name is fine: in the
shell script the file name $*.mp3 expands correctly to e.g. мазать.mp3 .
The audio within the file consists of the Google robot voice reading the
string of percent-escaped characters literally, not reading the Russian
word.

I will try Random Coder's suggestion of a more complete user agent string -
 apparently http://whatsmyuseragent.com/ is a handy way to find out what
your browser claims to be :)

On Tue, Mar 31, 2015 at 9:50 PM, Tim Rühsen <address@hidden> wrote:

> Hi Steven,
>
> Am Dienstag, 31. März 2015, 18:11:58 schrieb Stephen Wells:
> > Dear all - I am currently trying to use wget to obtain mp3 files from the
> > Google Translate TTS system. In principle this can be done using:
> >
> > wget -U Mozilla -O "${string}.mp3" "
> > http://translate.google.com/translate_tts?tl=TL&q=${string}";
> >
> > where TL is a twoletter language code (en,fr,de and so on).
> >
> > However I am meeting a serious error when I try to send Russian strings
> > (tl=ru) in Cyrillic characters. I'm working in a UTF-8 environment (under
> > Cygwin) and the file system will display the cyrillic strings no problem.
> > If I provide a command like this:
> >
> > http://translate.google.com/translate_tts?tl=ru&q=мазать
> >
> > wget incorrectly processes the Cyrillic characters _before_ sending the
> > http request, so what it actually requests is:
> >
> >
> http://translate.google.com/translate_tts?tl=ru&q=%D0%BC%D0%B0%D0%B7%D0%B0%D
> > 1%82%D1%8C
>
> This seems to be the correct behavior of a web client.
> The URL in the GET request is transmitted UTF-8 encoded and percent
> escaping
> is performed for chars >127 (not mentioning control chars here).
>
> > This of course produces a string of gibberish in the resulting mp3 file!
>
> This is something different. If you are talking about the file name, well
> there is --restrict-file-names=nocontrol. Did you give it a try ?
>
> > Is there any way to make wget actually send the string it is given,
> instead
> > of mangling it on the way out? This is really blocking me.
>
> From what you write, I am unsure if you are talking about the resulting
> file
> name or about HTTP URL encoding in a GET request.
>
> Regards, Tim
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]