Associated Press (AP) News plugin

I don’t know about bypassing dupe checks. When a plugin generates a feed, one of the elements is a guid - globally unique id; and tt-rss won’t duplicate guids (they are unique after all). So unless the guid a plugin generates is totally random there should be no way (or reason) to bypass dupe checking.

What my plugin does (and probably the Apnews one too) is use the article link as the guid. So it should not create duplicate articles. But take this link for example:

  • https://apnews.com/article/hundreds-claim-abuse-by-youth-center-staff-c7d7e348269a1c80fa3c902818df0399

If that postfix number changes then the link would again appear to be unique. So when apnews updates an article to “keep it fresh”, does the link remain the same or change?

As for the ban warning, let’s say you are loading apnews into tt-rss for the first time; that will generate 1 curl fetch for the news page and 1 curl fetch for each article you want to “slurp” in. If there are 50 articles then that is at least 51 rapid fire hits to the webserver (more if you cache media). That might be enough to get you noticed as a webscraper and banned. From then on you should only be loading the deltas (new articles) and the traffic will much lower from then on. YMMV.

This plugin does a single request per feed (e.g. World News: Top & Breaking World News Today | AP News) at the frequency determined by the feed’s update interval. There are no follow-up requests for individual articles, images, etc. AP News doesn’t appear to support If-Modified-Since or If-None-Match for this content.

As far as article duplicate checking-- this plugin does play nicely with tt-rss’s duplicate checking (debug a feed and look for “stored article seems up to date”). There’s not much that can be done if AP News reposts something as a new, slightly-modified article, though.

the wiki page might be a bit too dramatic but i think something like this did happen, at least once.

normally a plugin can’t modify GUID and won’t be invoked at all if article checksum haven’t changed which enforces at least some kind of rate-limiting (both for hitting origin servers and generating cpu load on your server) but plugins that operate in the context of entire feed (not individual articles) are above those restrictions. therefore, the warning. it doesn’t mean the plugin is necessarily bad or anything.

Thanks for your quick replies. For AP articles that are ‘freshened’, I can’t say if the postfix number changes without some tedious manual tracking. But even with true RSS feeds, articles sometimes appear in tt-rss after I’ve already seen them. I can live with that.

The rate-limiting discussion is interesting, but duplicate suppression is the main value that tt-rss provides, so at this point I’m more interested in understanding what Fox’s wiki page says: that plugins like apnews that modify a non-RSS feed for processing by tt-rss necessarily bypass the duplicate checking process.

Thanks fox, good info.

ginahoy, this brings up an interesting idea; I could add an option to use the article headline (anchor text) as the guid. As long as the article text remained the same, it would not matter what the “serial number” postfix is. That would be a quick patch to the xslt code. Not sure how quickly I can get to it though and you stated that you can’t install plugins. : /