Tiny Tiny RSS: Community

Codepage not detected in af_readability

#1

tt-rss git (48c2db6ef1), Ubuntu 18.04, Mariadb 10.4 RC1, php 7.3, nginx/1.17.0.

Describe the problem you’re having:

Wrong codepage from koi8-r sites.
Screenshot: https://cdn1.savepice.ru/uploads/2019/6/6/e2d8e76e2a667f6fe4537218e2745348-full.png

If possible include steps to reproduce the problem:

  1. Enable af_readability.
  2. Subscribe to http://www.opennet.ru/opennews/opennews_all_noadv.rss
  3. Enable parsing feed with af_readability.
  4. Fetch this feed.

tt-rss version (including git commit id):

48c2db6ef1 (all versions in fact)

Platform (i.e. Linux distro, PHP, PostgreSQL, etc) versions:

Ubuntu 18.04, Mariadb 10.4 RC1, php 7.3, nginx/1.17.0.

Please provide any additional information below:

It seems that this happens with all rss-feeds, whose encoding is different from utf8.

#2

antediluvian markup of this site gives me strong nostalgia feels:

<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0 WIDTH="100%">

they couldn’t pay me enough to deal with this codebase. :face_vomiting:

anyway, actually af_readability has a hack to deal with non-unicode feeds of this nature.

unfortunately it didn’t work this time because i didn’t expect there were still sites out there not closing their <meta> tags in place:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=koi8-r">

i guess i should have expected this because obsolete markup and obsolete charsets go hand in hand.

luckily it’s a simple fix: https://git.tt-rss.org/fox/tt-rss/commit/967cccb7c58b28dccfa4e9599c4e282e418f8c67

#3

Thank you. It works now.