Af_readability: Relative URLs Not Rewritten Correctly

  • [ ] I’m using stock docker compose setup, unmodified.
  • [ ] I’m using docker compose setup, with modifications (modified .yml files, third party plugins/themes, etc.) - if so, describe your modifications in your post. Before reporting, see if your issue can be reproduced on the unmodified setup.
  • [x] I’m not using docker on my primary instance, but my issue can be reproduced on the aforementioned docker setup and/or official demo.

Description

The af_readabilty plugin does not handle relative URLs correctly. The HTML spec requires that the last element be removed from the base URL before appending the relative URL. af_readability, however, simply appends the relative URL to the article base.

(I know, relative URLs suck…)

Spec: URL Standard, relative state, item 5.4.2.
af_readability code: https://git.tt-rss.org/fox/tt-rss.git/tree/plugins/af_readability/init.php#n228

Example

The article https://apod.nasa.gov/apod/ap220315.html links to the image image/2203/Road2Stars_EsoHoralek_1080.jpg. The af_readability rewrites this to https://apod.nasa.gov/apod/ap220315.html/image/2203/Road2Stars_EsoHoralek_4000.jpg, i.e. it appends the relative URL without removing the last element ap220315.html.

Steps to reproduce

  • Add the APOD RSS feed https://apod.nasa.gov/apod.rss.
  • Enable af_readability plugin.
  • Enable Inline article content.
  • Fetch feed contents. The images are not displayed, and the links are broken.

Environment

  • Tiny Tiny RSS version (including git commit id): v22.03-1703850
  • Platform (i.e. Linux distro, Docker, PHP, PostgreSQL, etc) versions: Official demo
  1. relative urls have no place in rss feeds
  2. rewriting is there for security reasons, it only does the bare minimum to ensure you wouldn’t own yourself
  3. this is not going to change

Thanks for the fast reply!

Just to be clear: This is not about relative URLs in the feed, but on the website that af_readability fetches. And while I totally agree that feeds should always use absolute URLs, relative URLs are unfortunately still very common on regular websites.

If you do not want to change this behavior, then it would at least be good to document that af_readability does not support websites with relative links.

ah, right, readability. i don’t think readability canonicalizes URLs at all, so tt-rss processes resulting HTML data rewriting URLs as if they were in the feed.

readability is, by design, a hack. it works, kinda, until it doesn’t. that’s just how it is. i think there’s a README warning on the plugin to this effect although maybe i’m misremembering things.

we used to have a more complicated URL rewriting before which was more clever with relative URL parts but the contributed implementation was so opaque I had trouble understanding what it did so I just cut it away and replaced with something much simpler. dealing with relative URLs is of critical importance so it makes sense to have it as bulletproof as possible even if it breaks edge cases like readability.

I just checked, there is no README or similar in plugins/af_readability. Also, I did not find anything in the wiki or the FAQ.

I agree that the URL spec is too complicated and that it would be too complex to cover all cases. To fix the majority of cases, however, it would suffice to remove everything after the base URL’s last slash if the relative URL’s path is nonempty (see RFC 3986).

Again, if you do not want to make this change, then please consider documenting this behavior somewhere.

Oh, maybe even better: PHP Readability has an option to fix relative URLs: GitHub - fivefilters/readability.php: PHP port of Mozilla's Readability.js

that looks good, i’ll make a note to poke at this.

adjusting core rewriting behavior could also be useful, it’ll help to have a bunch of test urls i.e. original base A, relative part B, should rewrite to C.

as usual, no eta.

Thank you! And yes, this is not a critical issue, so take your time.

this should deal with readability-php configuration:

https://git-gitea.tt-rss.org/fox/tt-rss/commit/711662948768492e8d05b778a7d80eacaec368d2

i’ll take a look at core URL rewriting next. readability has a fairly simple implementation (at Readability.php:571) maybe I could lift something there.

I just tried your changes, and they fix the APOD images. Thank you!

https://git-gitea.tt-rss.org/fox/tt-rss/commit/1c4f7ab3b838b23afb2ee4dab14acbf75956e952

more examples of various relative-to-absolute URLs would be very helpful to add to the phpunit test i’ve added.

Mozilla has an extensive set of test cases:
https://searchfox.org/mozilla-central/source/netwerk/test/gtest/urltestdata.json
It would be total overkill in its entirety, but maybe you can find some inspiration there. Off the top of my head, I would add the following cases:

  • http://example.com/test/url + (empty path) → http://example.com/test/url (empty relative path does not remove last element)
  • http://example.com/test/url + /http://example.com/ (root path)
  • http://www.example2.com/test + http://www.example.com/test2http://www.example.com/test2 (URLs with scheme are interpreted as being absolute)

thanks!

that won’t work, second URL is not relative so it would just return “http://www.example.com/test”. i don’t see why it should too, that would be arbitrarily moving paths between domains.

Sorry, my mistake, I fixed my post above.

https://git-gitea.tt-rss.org/fox/tt-rss/commit/e35a4a1306d7fe0736d2f6ba3e0284308d29ebd0

Hi
Just updated to the latest and getting these 2 types of errors constantly

yeah, that should be already fixed in trunk.

Ah sorry, looks like I jumped the gun.

Edit:
I’m still seeing the same issue, have to see whether other people report the same thing

edit: latest update fixed this