ParseException thrown in af_redditimgur


#1

Sorry fox, I’m not at home and can’t really dig into this as much as I usually would. Guessing this is some niche website outputting their response in a non-standard way because reasons.

Describe the problem you’re having:

[22:12:01/30699] article processed
[22:12:01/30699] guid 1,t3_91fa88 / SHA1:4f92c07a8a4ae127eb6896a81467b0e42d6de3d0
[22:12:01/30699] orig date: 1532417236
[22:12:01/30699] date 1532417236 [2018/07/24 07:27:16]
[22:12:01/30699] title elfbac - runtime intent-level ABI-granular memory protection for Linux
[22:12:01/30699] link https://www.reddit.com/r/netsec/comments/91fa88/elfbac_runtime_intentlevel_abigranular_memory/
[22:12:01/30699] author /u/wademealing
[22:12:01/30699] num_comments: 0
[22:12:01/30699] looking for tags…
[22:12:01/30699] tags found: netsec
[22:12:01/30699] done collecting data.
[22:12:01/30699] article hash: 43e8bc37359ace5179881c159041baccdd1a938b [stored=]
[22:12:01/30699] hash differs, applying plugin filters:
[22:12:01/30699] … Af_Comics
[22:12:01/30699] === 0.0000 (sec)
[22:12:01/30699] … Af_Fsckportal
[22:12:01/30699] === 0.0001 (sec)
[22:12:01/30699] … Af_RedditImgur
PHP Fatal error: Uncaught andreskrey\Readability\ParseException: Invalid or incomplete HTML. in /opt/tt-rss/vendor/andreskrey/Readability/Readability.php:142
Stack trace:
#0 /opt/tt-rss/plugins/af_redditimgur/init.php(527): andreskrey\Readability\Readability->parse(’\r\nreadability(Array, ‘http://elfbac.o…’, Object(DOMDocument), Object(DOMXPath))
#2 /opt/tt-rss/classes/rssutils.php(754): Af_RedditImgur->hook_article_filter(Array)
#3 /opt/tt-rss/update.php(415): RSSUtils::update_rss_feed(‘241’)
#4 {main}
thrown in /opt/tt-rss/vendor/andreskrey/Readability/Readability.php on line 142

If possible include steps to reproduce the problem:

Gets thrown every time tt-rss tries to parse https://www.reddit.com/r/netsec/.rss at the moment, either automatically or via the feed debugger.

tt-rss version (including git commit id):

Tiny Tiny RSS v17.12 (a2d1fa5)

Platform (i.e. Linux distro, PHP, PostgreSQL, etc) versions:

Ubuntu 18.04
lighttpd (unsure on version, pi-hole switched this on me and I haven’t gotten around to fixing it)
php 7.2.7-0ubuntu0.18.04.2
postgres 9.6.8 (thought I was on 10 something, but this was the select version() output)

Please provide any additional information below:

Obviously the feed could change before you can check it, so here’s a pastebin of the feed containing the troublesome entry, if you don’t get a crack at it: https://pastebin.com/0qp0fPdi

Seems like it may be generally breaking tt-rss’ ability to check feeds until the troublesome entry is cleared. Going to turn off af_redditimgur for now. I can see in journalctl that other sites/entries have done this, but it’s too far back for me to get many details about that.

Edit:
Meant to save you a click. The submission just points to “http://elfbac.org/

Edit 2:
Got a few minutes that I used to dig. Using wget to grab elfbac.org redirects to http://elfbac.org/SViZZ (probably changes per user), then redirects BACK to elfbac.org, which is just a frame that embeds http://www.cs.dartmouth.edu/~sergey/elfbac/. Sorry, I’m an inferior user unfamiliar with curl from the command prompt. Not sure exactly why readability chokes on this nonsense, but I’m not particularly surprised that it does, either.


#2

well, regardless of the site behaving badly or not, tt-rss should handle that exception

i’ll take a look tomorrow, thanks for reporting


#3

i’m probably missing something obvious but ParseException is already handled in af_redditimgur:

the code is exactly the same as in the docs:

maybe it’s the wrong kind of ParseException (?) and it should be named including the namespace, idk. if this still happens you can try replacing that with generic catch (Exception $e) and see if that helps.


Uncaught exception in readability
#4

Hi Fox,

I’m probably going to be really busy and away from internet for the next week or two, but I do want to dig into this issue and try to figure out what’s going wrong. It has happened something like four or more times in the month of July (more so with /r/netsec, but also my front page feed,) so it’s definitely not just one person’s weird website setup, and it’s happening just often enough to irritate me into action.

I will follow up once I’ve had time to dig into this, it just may be a while before things settle down enough. Thank you for taking a look and seeing if anything stuck out.


#5

fix is in trunk, i guess (if it works, i don’t have any feeds where readability does this)


#6

Got a chance to sit down and pull from git, then test using the pastebin I had created. It seems to parse that entry now without error, which you probably already knew.

I’ll try to keep an eye out for repeats, but it seemed to be sporadic. Thanks for taking care of this.