[ ] I’m using stock docker compose setup, unmodified.
[ ] I’m using docker compose setup, with modifications (modified .yml files, third party plugins/themes, etc.) - if so, describe your modifications in your post. Before reporting, see if your issue can be reproduced on the unmodified setup.
[x] I’m not using docker on my primary instance, but my issue can be reproduced on the aforementioned docker setup and/or official demo.
Description
The af_readabilty plugin does not handle relative URLs correctly. The HTML spec requires that the last element be removed from the base URL before appending the relative URL. af_readability, however, simply appends the relative URL to the article base.
The article https://apod.nasa.gov/apod/ap220315.html links to the image image/2203/Road2Stars_EsoHoralek_1080.jpg. The af_readability rewrites this to https://apod.nasa.gov/apod/ap220315.html/image/2203/Road2Stars_EsoHoralek_4000.jpg, i.e. it appends the relative URL without removing the last element ap220315.html.
Steps to reproduce
Add the APOD RSS feed https://apod.nasa.gov/apod.rss.
Enable af_readability plugin.
Enable Inline article content.
Fetch feed contents. The images are not displayed, and the links are broken.
Environment
Tiny Tiny RSS version (including git commit id): v22.03-1703850
Platform (i.e. Linux distro, Docker, PHP, PostgreSQL, etc) versions: Official demo
Just to be clear: This is not about relative URLs in the feed, but on the website that af_readability fetches. And while I totally agree that feeds should always use absolute URLs, relative URLs are unfortunately still very common on regular websites.
If you do not want to change this behavior, then it would at least be good to document that af_readability does not support websites with relative links.
ah, right, readability. i don’t think readability canonicalizes URLs at all, so tt-rss processes resulting HTML data rewriting URLs as if they were in the feed.
readability is, by design, a hack. it works, kinda, until it doesn’t. that’s just how it is. i think there’s a README warning on the plugin to this effect although maybe i’m misremembering things.
we used to have a more complicated URL rewriting before which was more clever with relative URL parts but the contributed implementation was so opaque I had trouble understanding what it did so I just cut it away and replaced with something much simpler. dealing with relative URLs is of critical importance so it makes sense to have it as bulletproof as possible even if it breaks edge cases like readability.
I just checked, there is no README or similar in plugins/af_readability. Also, I did not find anything in the wiki or the FAQ.
I agree that the URL spec is too complicated and that it would be too complex to cover all cases. To fix the majority of cases, however, it would suffice to remove everything after the base URL’s last slash if the relative URL’s path is nonempty (see RFC 3986).
Again, if you do not want to make this change, then please consider documenting this behavior somewhere.
that looks good, i’ll make a note to poke at this.
adjusting core rewriting behavior could also be useful, it’ll help to have a bunch of test urls i.e. original base A, relative part B, should rewrite to C.
i’ll take a look at core URL rewriting next. readability has a fairly simple implementation (at Readability.php:571) maybe I could lift something there.
that won’t work, second URL is not relative so it would just return “http://www.example.com/test”. i don’t see why it should too, that would be arbitrarily moving paths between domains.