Logic of dupe elimination


#1

I’ve recently installed the trigam extension to benefit from the the feature ‘Mark similar articles as read’. Useful stuff. One more reason to prefer PostgreSQL to MySQL.

Now I’m trying to understand how the feature works. Does it compare articles within one feed or across feeds? If it finds a duplicate pair, which article gets marked read?


#2

as far as i remember it works on the entire article database but only acts on selected feeds.

the idea here being that if you’re subscribed to three feeds about duck-related news and feed 1 is an authoritative source you’re most interested in, you keep the plugin enabled for two other feeds so that potential reposts sort themselves out and you get your major duck news source uninterrupted.

however, if none of the feeds is primary so to speak you can just enable it for all of them and then the first article is going to be the reference one. it also does mark articles within the currently processed feed itself.

tbh it doesn’t work all that well because news sources really like to change titles a lot while copypasting articles off each other and bare n-gram matching is a poor substitute for intelligent keyword extraction / actual text analysis of some kind.

unfortunately making the latter is way above my pay grade as an uneducated hack of a programmer so :shrug: etc.

i also suggest you check out perceptual hash plugin if you are subbed to a lot of image-heavy feeds, that one actually works really well.


#3

Thanks, Andrew, for the detailed explanation. Much appreciated. To sum up:

  • The plugin analyses all subscribed feeds.
  • Leave it off for the best feed, turn it on for the second-best feeds that are likely to echo the best feed.

You say the plugin could be better. Well, it’s smart enough to detect similar titles. For example, today’s headlines are full of ‘Putin meet Trump in Hamburg’. Those get marked read for me.


#4

Just to illustrate fox’s words to anyone interested:

In my case, the original news item by Associated Press Sports is republished by seven newspapers. I have the plugin enabled for all feeds, and it decided to spare Daily Mail and kill USA Today, US News & World Report et al. I don’t mind it picking Daily Mail as ‘authoritative’. :smile: