UTF-8 bytes encoded as XML entities in description


#1

I found some garbled characters in one feed. I investigated it and found the following:

  • The Atom feed contains the following: <description>it&#xE2;&#x80;&#x99;s</descritption>.
  • These bytes E2 80 99 are the utf8 representation for .
  • Doing a select * from ttrss_entries from the command line shows these characters as â followed by two boxes. â is the ISO-8859-1 character for E2, and the other two bytes don’t exist in ISO-8859-1.
  • The ttrss web UI and app show them as ’. That’s these bytes interpreted as Windows-1252.

I’m guessing the issue is from the feed itself, but I’m not familiar with the Atom standard and I don’t know what to tell the site operators. Should they change these characters into &#2019;? (\u2019 is in unicode). Or should they add an encoding field somewhere? (the feed has <?xml version='1.0' encoding='UTF-8'?> at the top, but I think that’s irrelevant here).

Alternatively, is ttrss handling this correctly or could it be fixed somehow?

Using Tiny Tiny RSS v17.12 (2c51fac), Centos 7.4, nginx 1.12.2, PHP 5.4.16, Postgresql 9.2.23.


#2

encoding in the xml preamble should be enough to handle the document properly, no specific field should be needed. my guess would be field encoding being incorrect because it’s more likely than libxml parsing it incorrectly. i don’t really know enough about this stuff to tell you how and why though.


#3

The site is generating the XML file incorrectly.

Let’s start with the encoding declaration. That just means the XML file is to be interpreted as UTF-8. That means the & is a UTF-8-encoded character, as is the #, the x, the E, etc… It has nothing to do with how the character reference is expanded.

For that, we go to the XML specification. Specifically:

If the character reference begins with " &#x ", the digits and letters up to the terminating ; provide a hexadecimal representation of the character’s code point in ISO/IEC 10646.

Crucially, each reference (e.g. &#xE2) should refer to a single complete character; a single complete code point. This has nothing to do with Windows-1252; code point 0xE2 represents â in UTF-8.

What they’ve done instead is encode each byte individually, rather than the entire code point for the character. So when tt-rss decodes the XML file correctly according to spec, it interprets each reference (which is incorrectly only a single byte) as a separate code point, resulting in what looks to be incorrect output.


This should be fixed by the site operator. But if you really want to fix it on tt-rss’s side, you can create a site-specific plugin that goes and interprets sequences of character references and transforms them into the correct single reference. Of course, this is rather fragile.


#4

Ah, that makes sense. Thanks. I’ll send an email to the site operators.

(Although I still don’t understand how € and ™ came to be. I could only explain them by Windows-1252)


#5

Ah. Hm. Yea, the 0x80 and 0x99 are not (by themselves) valid UTF-8 characters, so it’s likely your browser is falling back to Windows-1252. Sorry, I didn’t even realise - I was testing with a site that did a similar incorrect conversion.

The rest should hold though: the XML file should be sending entire code points, not single bytes.

Edit: here’s an example of a similar bug, http://www.perlmonks.org/?node_id=708751. Generally it’s because they forgot to decode whatever they’re reading the UTF-8 data from and end up sending the (incorrect) raw bytes into the XML encoder.