[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-diffutils] Bug#680990: Diff does not show BOM difference. (fwd)
From: |
jeanmichel . 123 |
Subject: |
Re: [bug-diffutils] Bug#680990: Diff does not show BOM difference. (fwd) |
Date: |
Tue, 17 Jul 2012 02:23:59 +0200 (CEST) |
I did not integrated it in diff, nor tested it.
Because I do not know diff code and compilation enough.
But as it appears there is no unicode support in diff tool, such a code that
IGNORE_ALL_UNICODE_SPACE in complement of IGNORE_ALL_SPACE
for instance:
toUTF8_1 ( 0x0009 ),
toUTF8_1 ( 0x000A ),
toUTF8_1 ( 0x000B ),
toUTF8_1 ( 0x000C ),
toUTF8_1 ( 0x000D ),
toUTF8_1 ( 0x0020 ),
// c2
toUTF8_2 ( 0x0085 ),
toUTF8_2 ( 0x00A0 ),
// e1
toUTF8_3 ( 0x1680 ),
toUTF8_3 ( 0x180E ),
// e2 80
toUTF8_3 ( 0x2000 ),
toUTF8_3 ( 0x2001 ),
toUTF8_3 ( 0x2002 ),
toUTF8_3 ( 0x2003 ),
toUTF8_3 ( 0x2004 ),
toUTF8_3 ( 0x2005 ),
toUTF8_3 ( 0x2006 ),
toUTF8_3 ( 0x2007 ),
toUTF8_3 ( 0x2008 ),
toUTF8_3 ( 0x2009 ),
toUTF8_3 ( 0x200A ),
toUTF8_3 ( 0x2028 ),
toUTF8_3 ( 0x2029 ),
toUTF8_3 ( 0x202F ),
// e2 81
toUTF8_3 ( 0x205F ),
// e3
toUTF8_3 ( 0x3000 ),
// ef
toUTF8_3 ( 0xfeff ),
(taken from http://en.wikipedia.org/wiki/Whitespace_character and added BOM)
might look like:
//////////////////////////////////////////////////////
typedef unsigned char byte;
/** Check if this byteSequence start with an utf8 space character */
// Return the number of bytes which match the space character if any.
int isUtf8Space(byte*input)
{
switch (input[0])
{
case 0x0009:
case 0x000A:
case 0x000B:
case 0x000C:
case 0x000D:
case 0x0020:
return 1;
break;
case 0x00c2:
if ( (input[1]==0x00a0) || (input[1]==0x0085) )
return 2;
break;
case 0x00e1:
if (( input[1]==0x009a) && (input[2]==0x0080) )
return 3;
if (( input[1]==0x00a0) && (input[2]==0x008e) )
return 3;
break;
case 0x00e2:
switch (input[1])
{
case 0x0080:
if ( ( input[2]>=0x80) &&
(input[2]<=0x8a) )
return 3;
if ( input[2]==0xa8)
return 3;
if ( input[2]==0xa9)
return 3;
if ( input[2]==0xaf)
return 3;
break;
case 0x0081:
if ( input[2]==0x9f)
return 3;
break;
case 0x0097:
if ( input[2]==0xbf)
return 3;
break;
}
break;
case 0x00e3:
if ( (input[1]==0x0080) && (input[2]==0x0080) )
return 3;
break;
case 0x00ef:
if ( (input[1]==0x00bb) && (input[2]==0x00bf) )
return 3;
break;
}
return 0;
}
/** test file*/
int main(int argc, char**argv)
{
char * t1 = argv[1];
char * t2 = argv[2];
int n;
char c1, c2;
//case IGNORE_ALL_UNICODE_SPACE:
/* For -w, just skip past any white space. */
while ( (n=isUtf8Space (t1)) && *t1 != '\n') t1+=n;
while ( (n=isUtf8Space (t2)) && *t2 != '\n') t2+=n;
c1 = *t1;
c2 = *t2;
//break;
printf ("<%s\n", t1);
printf (">%s\n", t2);
return strcmp(t1, t2);
}
//////////////////////////////////////////////////////
Unfortunately, it only handle UTF-8 and not UTF-16.
----- Mail original -----
It'd be reasonable to have 'diff' ignore byte-order-marks,
just as it already ignores things like white space,
when given a new option to do that.
Someone would have to write the code and documentation,
though.