Elimination of duplicates

eLeXeM · November 28, 2018, 2:43pm

after having read Filtering duplicates + Logic of dupe elimination - #2 by fox I’m wondering whether duplicates’ elimination has progressed in any manner workable for users not on PostgreSQL - perhaps there is a plugin of something? I’m not thinking of anything super sophisticated, just simple URL check - if a URL already exists in the DB just add any found tags / categories to the URL already stored instead of creating the same item again with differing taxonomy?

(I’m sadly not proficient enough in programming, so I cannot realy assess how hard or not it would be to go beyond that. From the utter user pov I’m thinking “the more characteristics match, the likelier an entry should be treated as a duplicate”? (url, guid, timestamp, title?))

If that’s already covered somehow somewhere, I’d be grateful for a related pointer;
much appreciated; cheers - LX

mamil · November 28, 2018, 8:40pm

But those URLs will be different for different feeds, won’t they?

From the utter user pov I’m thinking “the more characteristics match, the likelier an entry should be treated as a duplicate”? (url, guid, timestamp, title?)

No, no. The plugin compares only titles of articles. Beg fox to write a plugin around MySQL’s ngram parser.

JustAMacUser · November 28, 2018, 9:31pm

fox loves MySQL. True story.

fox · November 29, 2018, 6:53am

that works on any database out of the box, only replace URL with article identifier (as specified by feed - could be URL, could be something else).

you really only need heuristic tools if the feed is broken (that is, same article could be published under different GUIDs) or you’re dealing with multiple feeds reporting on the same event with similar wording (that’s the n-gram plugin).

eLeXeM · November 29, 2018, 11:16am

thank you, guys.
Now for me as utter non DB-pro :

is there any chance for me to get that to work by my lonesome? + if so, how might i do that? Cheers -