Af_psql_trgm and duplicate messages

Hello,

I have been using TTRSS for several years and was only active here reading. Now I have subscribed to the following newsfeed: http://www.tt.com/rss/news.xml. Here, unfortunately, many posts are repeated several times. These I wanted to filter out with the plugin “af_psql_trgm”, which unfortunately does not work. Here is an excerpt from the log:

[12:54:14/3501020] guid 2,https://www.tt.com/go/17703074 (hash: {"ver":2,"uid":2,"hash":"SHA1:78497ff45b44fade36912ca477859053ab459da5"} compat: SHA1:615627c26315a48f42f2b7f7e62b781afffb0add)
[12:54:14/3501020] orig date: 1609618680
[12:54:14/3501020] title Schwerer Anschlag im Niger: Mehr als 50 Tote
[12:54:14/3501020] link https://www.tt.com/go/17703074
[12:54:14/3501020] language de
[12:54:14/3501020] author 
[12:54:14/3501020] looking for tags...
[12:54:14/3501020] tags found: __special, newsticker
[12:54:14/3501020] done collecting data.
[12:54:14/3501020] looking for enclosures...
[12:54:14/3501020] article hash: a1bfd5d807542e4cd84b8cd6b7d1d4bbd68d76b5 [stored=a1bfd5d807542e4cd84b8cd6b7d1d4bbd68d76b5]
[12:54:14/3501020] hash differs, applying plugin filters:
[12:54:14/3501020] ... Af_Fsckportal
[12:54:14/3501020] === 0.0001 (sec)
[12:54:14/3501020] ... Af_Psql_Trgm
[12:54:14/3501020] === 0.0003 (sec)
[12:54:14/3501020] ... Af_Readability
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 95 16777216
[12:54:14/3501020] [curl progressfunction] 95 16777216
[12:54:14/3501020] [curl progressfunction] 95 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 3630 16777216
[12:54:14/3501020] [curl progressfunction] 3630 16777216
[12:54:14/3501020] [curl progressfunction] 3630 16777216
[12:54:14/3501020] [curl progressfunction] 3630 16777216
[12:54:14/3501020] [curl progressfunction] 7215 16777216
[12:54:14/3501020] [curl progressfunction] 7215 16777216
[12:54:14/3501020] [curl progressfunction] 7215 16777216
[12:54:14/3501020] [curl progressfunction] 7215 16777216
[12:54:14/3501020] [curl progressfunction] 7215 16777216
[12:54:14/3501020] [curl progressfunction] 7215 16777216
[12:54:14/3501020] [curl progressfunction] 7215 16777216
[12:54:14/3501020] [curl progressfunction] 7215 16777216
[12:54:14/3501020] [curl progressfunction] 14781 16777216
[12:54:14/3501020] [curl progressfunction] 14781 16777216
[12:54:14/3501020] [curl progressfunction] 14781 16777216
[12:54:14/3501020] [curl progressfunction] 14781 16777216
[12:54:14/3501020] [curl progressfunction] 18702 16777216
[12:54:14/3501020] [curl progressfunction] 18702 16777216
[12:54:14/3501020] [curl progressfunction] 18702 16777216
[12:54:14/3501020] [curl progressfunction] 18702 16777216
[12:54:14/3501020] [curl progressfunction] 18702 16777216
[12:54:14/3501020] [curl progressfunction] 18702 16777216
[12:54:14/3501020] [curl progressfunction] 18702 16777216
[12:54:14/3501020] [curl progressfunction] 18702 16777216
[12:54:14/3501020] [curl progressfunction] 18702 16777216
[12:54:14/3501020] [curl progressfunction] 18702 16777216
[12:54:14/3501020] [curl progressfunction] 27645 16777216
[12:54:14/3501020] [curl progressfunction] 27645 16777216
[12:54:14/3501020] [curl progressfunction] 32520 16777216
[12:54:14/3501020] [curl progressfunction] 32520 16777216
[12:54:14/3501020] [curl progressfunction] 35009 16777216
[12:54:14/3501020] [curl progressfunction] 35009 16777216
[12:54:14/3501020] [curl progressfunction] 35009 16777216
[12:54:14/3501020] [curl progressfunction] 35009 16777216
[12:54:14/3501020] === 0.5873 (sec)
[12:54:14/3501020] ... Feediron
[12:54:14/3501020] === 0.0000 (sec)
[12:54:14/3501020] plugin data: af_fsckportal,af_psql_trgm,af_readability,feediron,
[12:54:14/3501020] matched filters: 
[12:54:14/3501020] matched filter rules: 
[12:54:14/3501020] filter actions: 
[12:54:14/3501020] date 1609618680 [2021/01/02 20:18:00]
[12:54:14/3501020] num_comments: 0
[12:54:14/3501020] article labels:
[12:54:14/3501020] force catchup: 
[12:54:14/3501020] base guid found, checking for user record
[12:54:14/3501020] initial score: 0 [including plugin modifier: 0]
[12:54:14/3501020] user record FOUND: RID: 266468, IID: 235967
[12:54:14/3501020] resulting RID: 266468, IID: 235967
[12:54:14/3501020] article updated, but we're forbidden to mark it unread.
[12:54:14/3501020] assigning labels [other]...
[12:54:14/3501020] assigning labels [filters]...
[12:54:14/3501020] article enclosures:
Array
(
)
[12:54:14/3501020] filtered tags: __special, newsticker
[12:54:14/3501020] article processed
[12:54:14/3501020] guid 2,https://www.tt.com/go/17703052 (hash: {"ver":2,"uid":2,"hash":"SHA1:4f69e88eaadc89d08d91c762b6236e625b11123c"} compat: SHA1:ee06737c8b61b7acbf72eedbcf264d6b514daeed)
[12:54:14/3501020] orig date: 1609617600
[12:54:14/3501020] title Schwerer Anschlag im Niger: Mehr als 50 Tote
[12:54:14/3501020] link https://www.tt.com/go/17703052
[12:54:14/3501020] language de
[12:54:14/3501020] author 
[12:54:14/3501020] looking for tags...
[12:54:14/3501020] tags found: __special, newsticker
[12:54:14/3501020] done collecting data.
[12:54:14/3501020] looking for enclosures...
[12:54:14/3501020] article hash: 79807604d4caf41beb4e192a9967609dafc96d4b [stored=79807604d4caf41beb4e192a9967609dafc96d4b]
[12:54:14/3501020] hash differs, applying plugin filters:
[12:54:14/3501020] ... Af_Fsckportal
[12:54:14/3501020] === 0.0001 (sec)
[12:54:14/3501020] ... Af_Psql_Trgm
[12:54:14/3501020] === 0.0004 (sec)
[12:54:14/3501020] ... Af_Readability
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:14/3501020] [curl progressfunction] 0 16777216
[12:54:15/3501020] [curl progressfunction] 95 16777216
[12:54:15/3501020] [curl progressfunction] 95 16777216
[12:54:15/3501020] [curl progressfunction] 95 16777216
[12:54:15/3501020] [curl progressfunction] 0 16777216
[12:54:15/3501020] [curl progressfunction] 0 16777216
[12:54:15/3501020] [curl progressfunction] 0 16777216
[12:54:15/3501020] [curl progressfunction] 0 16777216
[12:54:15/3501020] [curl progressfunction] 0 16777216
[12:54:15/3501020] [curl progressfunction] 0 16777216
[12:54:15/3501020] [curl progressfunction] 7363 16777216
[12:54:15/3501020] [curl progressfunction] 7363 16777216
[12:54:15/3501020] [curl progressfunction] 7363 16777216
[12:54:15/3501020] [curl progressfunction] 7363 16777216
[12:54:15/3501020] [curl progressfunction] 7363 16777216
[12:54:15/3501020] [curl progressfunction] 7363 16777216
[12:54:15/3501020] [curl progressfunction] 16409 16777216
[12:54:15/3501020] [curl progressfunction] 16409 16777216
[12:54:15/3501020] [curl progressfunction] 19541 16777216
[12:54:15/3501020] [curl progressfunction] 19541 16777216
[12:54:15/3501020] [curl progressfunction] 24882 16777216
[12:54:15/3501020] [curl progressfunction] 24882 16777216
[12:54:15/3501020] [curl progressfunction] 29146 16777216
[12:54:15/3501020] [curl progressfunction] 29146 16777216
[12:54:15/3501020] [curl progressfunction] 32822 16777216
[12:54:15/3501020] [curl progressfunction] 32822 16777216
[12:54:15/3501020] [curl progressfunction] 32822 16777216
[12:54:15/3501020] [curl progressfunction] 32822 16777216
[12:54:15/3501020] [curl progressfunction] 32822 16777216
[12:54:15/3501020] [curl progressfunction] 32822 16777216
[12:54:15/3501020] === 0.8777 (sec)
[12:54:15/3501020] ... Feediron
[12:54:15/3501020] === 0.0000 (sec)
[12:54:15/3501020] plugin data: af_fsckportal,af_psql_trgm,af_readability,feediron,
[12:54:15/3501020] matched filters: 
[12:54:15/3501020] matched filter rules: 
[12:54:15/3501020] filter actions: 
[12:54:15/3501020] date 1609617600 [2021/01/02 20:00:00]
[12:54:15/3501020] num_comments: 0
[12:54:15/3501020] article labels:
[12:54:15/3501020] force catchup: 
[12:54:15/3501020] base guid found, checking for user record
[12:54:15/3501020] initial score: 0 [including plugin modifier: 0]
[12:54:15/3501020] user record FOUND: RID: 266469, IID: 235968
[12:54:15/3501020] resulting RID: 266469, IID: 235968
[12:54:15/3501020] article updated, but we're forbidden to mark it unread.
[12:54:15/3501020] assigning labels [other]...
[12:54:15/3501020] assigning labels [filters]...
[12:54:15/3501020] article enclosures:
Array
(
)
[12:54:15/3501020] filtered tags: __special, newsticker
[12:54:15/3501020] article processed

I have the following settings active in the plugin:

Minimum similarity: 0.75
Minimum title length: 32

Tiny Tiny RSS v20.12-e86b2e60d © 2005-2021 Andrew Dolgov

:~$ cat /proc/version 
Linux version 5.8.0-0.bpo.2-amd64 ([email protected]) (gcc-8 (Debian 8.3.0-6) 8.3.0, GNU ld (GNU Binutils for Debian) 2.31.1) #1 SMP Debian 5.8.10-1~bpo10+1 (2020-09-26)
:~$ php -v
PHP 7.3.19-1~deb10u1 (cli) (built: Jul  5 2020 06:46:45) ( NTS )

For testing I installed Docker and tested here current version, unfortunately the same problem.

What else can I try out?

Thanks, Ronny

How does duplicate message detection work? Is only the title, the content or both checked? Is it possible that the detection does not work for messages that have only a title and no text content?

duplicates are detected by GUID, title has mostly nothing to do with it (check the forums or source code for details, i remember posting about it before).

also i vaguely remember a gotcha with this plugin and exact matches being ignored because of feeds that have tons of articles with one exact title. i don’t have the source in front of me right now though so this might be false.

But in the plugin “Af_Psql_Trgm” I can set “Minimum similarity” and “Minimum title length”. Therefore, I already thought that a text comparison is made here?

In the mentioned newsfeed (see first post) there is no “description” for many articles. In addition, many titles are repeated very often, but the GUID is always different. What does the plugin really compare, especially if there is no content other than the title?

Ronny

it compares the title, that’s it. it would be largely pointless to compare the summaries, the wording is usually different enough to make trigram comparisons useless.

as to your other questions i refer you to the source code.

e: it would be easier to debug things if you posted the feed url, also again i’m not sure on the exact comparison thing, but i don’t really want to dig into this on a weekend, so it’ll have to wait until next week.

That’s fine, just don’t stress!

Here is the URL again:

http://www.tt.com/rss/news.xml

Ronny

alright, i went through af_psql_trgm, cleaned it up a bit and fixed some bugs, it seems to work now on your feed:

image

(greyed out stuff was automatically marked as read)

also, i was misremembering about equal titles: it used to require special handling but this seems to be no longer the case.

changeset - https://git.tt-rss.org/fox/tt-rss/commit/6d4005f984ca472b0c4e3ef87176a1cd222cf66c

e: also, there’s some more debugging at LOG_EXTENDED level in feed debugger.

Super! Thank you! The update is installed. Hopefully I won’t have so many new messages tomorrow.

Ronny