Tiny Tiny RSS: Community

Handling of Character not in repertoire: 7 ERROR: invalid byte sequence for encoding "UTF8"

#1

I’ve got a xml feed which is technically not valid (auto-generated by a tool I do not control) which is accessible here
Putting it through https://fakecake.org/myfeedsucks/ does not indicate any problems (I guess, thanks to nice error handling from the feed parser for such feeds).
Unfortunately this error handling is too forgiving for putting the content in PostgreSQL. This fails with error:

SQLSTATE[22021]: Character not in repertoire: 7 ERROR: invalid byte sequence for encoding "UTF8": 0x93#0 /var/www/tt-rss/classes/rssutils.php(852): PDOStatement->execute(Array)
#1 /var/www/tt-rss/classes/rssutils.php(148): RSSUtils::update_rss_feed(210, true, false)
#2 /var/www/tt-rss/update.php(205): RSSUtils::update_daemon_common(50)
#3 {main}

Full log at https://csachweh.de/share/debug_feed.txt

I would expect, if the feedparser is forgiving XML failures, the escaping should be able to handle this too. Or otherwise mark the whole feed as defective because you only see this update failure by looking into the logs.
Nevertheless, I will try to reach out to the programmer of the tool I am using.

tt-rss version (including git commit id):

current git version cb0c81729d30daae48f15c4d9dbc026aea506710
on
Debian Stretch with PHP 7.0.33-0+deb9u3 and PostgreSQL 9.6

#2

that’s strange. if libxml actually parsed the document and considers it unicode, one would assume it shouldn’t return data which isn’t unicode-clean.

on my postgres tt-rss instance i’m not seeing any database errors with this feed but it does show broken unicode characters, i.e.:

Leser 141/1: Zunächst vielen Dank für Ihren Karriere-Newsletter, der stets interessante und hilfreiche Informationen enthält. Insbesondere die Checkliste in Nr. 137 hat mir sehr gefallen und wird in der nächsten Bewerbungsphase zum Einsatz kommen.

:thinking:

this encoding-related stuff is always a colossal pain in the ass for some reason.

e: there’s this thing at feedparser.php:65, maybe this particular character range (?) should be added there too?

i’m sort of coming to the conclusion here that maybe removing the libxml-related hacks would be for the best. we go through entirely too much effort to parse broken content, which can (and will) produce unpredictably broken results.

#3

RIP charset hacks: https://git.tt-rss.org/fox/tt-rss/commit/1a484ec3f58ff5e7a5745f231b68cb64ba65929d

let’s see how this goes. it’s going to break some amount of feeds but ultimately it’s for the best. trying to fix XML with regular expressions is never a good idea. what was i thinking, etc.

#4

Thank you, now it is displayed as defective. At least for my system this is the only feed which fails, happy me!
Would it be possible to activate those xml hacks as a plugin for specific feeds?

#5

certainly, using HOOK_FEED_FETCHED.

#6

Encoding stuff is weird.
I said the feed is failing now and I was satisfied, but then the author fixed his xml generation, now it is valid again.
Unfortunately the failure message stays.
PHP Fatal error: Uncaught PDOException: SQLSTATE[22021]: Character not in repertoire: 7 ERROR: invalid byte sequence for encoding "UTF8": 0x93 in /var/www/tt-rss/classes/rssutils.php:852

This means, the feed is failing again and the feed is not marked as invalid, because it fails when storing it to Postgres (add debug url for reproducing).

I think my database was created with the right encoding?

[email protected]:~$ psql ttrss -c 'SHOW SERVER_ENCODING'
 server_encoding 
-----------------
 UTF8
(1 row)

The sign 0x93 <“> seems to be the problem?

My server is returning those flags:

Date: Thu, 04 Apr 2019 11:11:02 GMT
Server: Apache/2.4.25 (Debian)
Upgrade: h2,h2c
Connection: Upgrade
Last-Modified: Thu, 04 Apr 2019 11:00:56 GMT
ETag: "47d1e-585b248dd9d15"
Accept-Ranges: bytes
Content-Length: 294174
Vary: Accept-Encoding
Content-Type: application/xml

But tt-rss behaves equal to the original host with response:

HTTP/2 200 
accept-ranges: bytes
content-type: application/xml
etag: "ppaqp36b1b"
last-modified: Mon, 01 Apr 2019 19:12:39 GMT
server: Caddy
content-length: 294239
date: Thu, 04 Apr 2019 11:11:28 GMT

After reading some mailing list https://www.postgresql.org/message-id/[email protected]ing.com this seems to be related to the client_encoding?
A comment on the php docu is using the function with additional options. ttrss as I interpret the code, is using the default of the php plugin $source_encoding = mb_detect_encoding($text); which maybe returns the “wrong” default for this specific case?

TL;DR;
XML hacks can come back, they weren’t the problem. The problem seems to be the handling of getting the encoding of the response.

e: seems like ttrss does not value any response headers? charset=Windows-1252

#7

we can’t really feed any server headers to DOMDocument (libxml), it parses the data with the information that is there (your previous feed dump had a charset specified in the xml preamble)

if charset is set to utf-8 in the preamble, server charset shouldn’t matter. if xml specifies utf8 but is actually cp-whatever, then the feed is broken and libxml shouldn’t parse it in the first place.

maybe the problem is within your postgres setup because, as i posted above, i didn’t get any postgres errors with that feed you posted. unicode characters were broken but inserted just fine.

#8

I am clueless, what can I search for other than the default encoding of my database? I only installed Postgres from the repositories and the only database is ttrss.

#9

honestly no idea. i always just create them with unicode encoding and it seems to work. fwiw tt-rss also specifies unicode when connecting.

i don’t think i’ll be able to help here, i’m afraid.

#10

i’ve fixed a possibly related issue today entirely by accident: DOMDocument apparently returns actual document data if it fails to parse something. in case of non-unicode documents, this data is not unicode clean.

this possibility never occured to me so SQL logger didn’t specifically ensure that error messages are valid unicode which caused another (always fatal) error when trying to log the previous, possibly non-fatal, condition.

what i’m saying here is this specific error message might have been a misleading secondary one.

e: also, i thought about passing charset header somehow in the feed parser but it seems like this would not be very productive - XML document should specify its encoding in the preamble anyway and libxml usually deals with it properly, trying to override it somehow with a possibly misconfigured server header is just asking for more hard to diagnose errors.

#11

Thank you for remembering, unfortunately it did not remove the warning. If I am really bored at some point, I am going to find out how to reproduce this problem first and write then again.