[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-gnupod] Encoding of non-ascii characters in GNUtunesDB.xml
From: |
H. Langos |
Subject: |
Re: [Bug-gnupod] Encoding of non-ascii characters in GNUtunesDB.xml |
Date: |
Tue, 15 Apr 2008 01:26:18 +0200 |
User-agent: |
Mutt/1.5.13 (2006-08-11) |
Patch of the patch ... performance is better but still could be improved
i guess.
Instead of making 4 utf8 conversions and 3 substring operations on each
character we are down to one ord() and one substr() per character. Still
bad but way better than before.
-henrik
PS: Anybody interested in getting complete usable files instead of
patches?
On Mon, Apr 14, 2008 at 08:12:30PM +0200, H. Langos wrote:
>
> Ok, here's the patch ...
>
> Took longer than I thought because UTF8 in perl is a major pain.
>
> cheers
> -henrik
>
> PS: The line "$xutf =~ tr/\000-\037//d;" is not without problems. It
> will reduce all control characters to nothing including TAB, LF,
> and CR eventhough they are valid XML characters.
>
> Could somebody check out how iTunes handles those? Does it also remove
> those characters or does it convert them into 	 and so on?
>
>
> On Mon, Apr 14, 2008 at 02:14:18PM +0200, H. Langos wrote:
> > Hi there,
> >
> > I wonder If anybody else has the ocassional problem with editing her/his
> > GNUtunesDB.xml.
> >
> > Since it is XML and the encoding is UTF-8 you don't have any problem as
> > long as your system is completely UTF-8 compliant. I however have a
> > mixed iso-8859-1 iso-8859-15 and UTF-8 mess and some of the editors
> > that I like to use are not very smart about handling the character
> > encoding.
> >
> > It would be very easy to convert everything outsite the ascii range to
> > the XML escaped version. So say, instead of some garbage you'd see
> > "ś" where a "Latin Small Letter s with Acute" is.
> >
> > Pro: GNUtunesDB.xml becomes a pure ascii file. No more editor/viewer
> > issues.
> >
> > Contra: The GNUtunesDB.xml becomes slightly bigger and for people with a
> > clean UTF-8 toolchain it becomes a little less readable. (Note: You can
> > still edit the file and insert native UTF-8 as you please.)
> >
> > Any thoughts?
> >
> > cheers
> > -henrik
> >
> >
> >
> > _______________________________________________
> > Bug-gnupod mailing list
> > address@hidden
> > http://lists.nongnu.org/mailman/listinfo/bug-gnupod
> commit 5ce6a9e9173dce95287ff4b15deda67b569dd365
> Author: Heinrich Langos <address@hidden>
> Date: Mon Apr 14 19:49:54 2008 +0200
>
> Changed encoding of unicode characters outside of ascii range to XML
> notation.
>
> This change will make your GNUtunesDB.xml into a pure ascii file. Making
> it
> easier to view and manipulate on non-utf8 capable systems.
>
> Note: "xescaped()" is not only called for attribute values but also for
> element names and attribute names. So if sombody comes up with non-ascii
> element names or attribute names we would have to treat those differently.
>
> diff --git a/src/ext/XMLhelper.pm b/src/ext/XMLhelper.pm
> index 5eaeb48..2a230a3 100755
> --- a/src/ext/XMLhelper.pm
> +++ b/src/ext/XMLhelper.pm
> @@ -124,8 +124,15 @@ sub xescaped {
> my $xutf = Unicode::String::utf8($ret)->utf8;
> #Remove 0x00 - 0x1f chars (we don't need them)
> $xutf =~ tr/\000-\037//d;
> -
> - return $xutf;
> + my $out = Unicode::String::utf8("")->utf8;
> + for (my $i = 0 ; $i < Unicode::String::utf8($xutf)->length ; $i++) {
> + if (Unicode::String::utf8($xutf)->substr($i,1)->ord > 127) {
> + $out .= '&#' .
> Unicode::String::utf8($xutf)->substr($i,1)->ord . ';';
> + } else {
> + $out .= Unicode::String::utf8($xutf)->substr($i,1) ;
> + }
> + }
> + return $out;
> }
>
>
> _______________________________________________
> Bug-gnupod mailing list
> address@hidden
> http://lists.nongnu.org/mailman/listinfo/bug-gnupod
feat_1ace2709_improved_performance_of_utf8_to_ascii_encoding.patch
Description: Text Data