Importing entries from other RSS reader

zeed · April 8, 2019, 2:21pm

Innoreader supports importing/exporting entries (especially those starred entries) to a JSON format.
Here is an example:

{
  "crawlTimeMsec":"1516226902000",
  "timestampUsec":"1516226902000000",
  "id":"tag:google.com,2005:reader\/item\/000000035d74960b",
  "categories":[
    "user\/1006616538\/state\/com.google\/reading-list",
    "user\/1006616538\/state\/com.google\/read",
    "user\/1006616538\/state\/com.google\/starred"
  ],
  "title":" THIS IS Titile  THIS IS Titile  THIS IS Titile  THIS IS Titile  THIS IS Titile  THIS IS Titile  THIS IS Titile ",
  "published":1516210059,
  "updated":1516230449,
  "starred":1516226902,
  "canonical":[
    {
      "href":"http::\/\/www.google.com"
    }
  ],
  "alternate":[
    {
      "href":"http::\/\/www.google.com",
      "type":"text\/html"
    }
  ],
  "summary":{
    "direction":"ltr",
    "content":"THIS IS CONTENT THIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENTTHIS IS CONTENT"
  },
  "author":"",
  "likingUsers":[
    
  ],
  "comments":[
    
  ],
  "commentsNum":-1,
  "annotations":[
    
  ],
  "origin":{
    "streamId":"feed\/http:\/\/www.google.com/feed.xml",
    "title":"SOMEONE's RSS FEED",
    "htmlUrl":"http:\/\/www.google.com"
  }
},

I think it is a cool feature for a seeming-less migration from other readers to TT-RSS.

Currently TT-RSS can import the list of feeds (OPML) and fetch the entries (contents) from that list. Since the remote feed might not preserve the full history, it is hard to transfer the starred entries from the old reader to TT-RSS.

It would be awesome if TT-RSS could import the list of entries (like the Innoreader JSON format), sort them into corresponding feeds according to the URL, and maybe also mark the stars automatically

zeed · April 8, 2019, 4:14pm

Sorry it is basically what the import/export plugin does.

But somehow the download doesn’t start on my computer, and I have to ssh into the server to get the exported xml file on the cache folder.

JustAMacUser · April 8, 2019, 6:03pm

If it’s not downloading you can check the HTTP response code in the browser and the logs on the server.

Regarding the original post, OPML is a standardized format; so it’s a good choice for TT-RSS. Anything else can be achieved through plugins.

fox · April 9, 2019, 4:06am

looks like i partially broke that plugin, thanks for reporting

should be fixed by https://git.tt-rss.org/fox/tt-rss/commit/d7282ec292a79120709e93ba9b1c73d0077d871b

zeed · April 9, 2019, 4:50am

I will try writing a script that converts the Inoreader JSON to the TT-RSS XML format.

Just in case I reinvent the wheel, is there any similar converter that exists already?

zeed · April 9, 2019, 1:29pm

I wrote a script to convert the Innoreader JSON to TTRSS XML format.

import json
import hashlib
from lxml import etree
from datetime import datetime
filename = "starred"
with open(filename+'.json',encoding="utf-8") as json_file:
    counter = 0
    data = json.load(json_file)["items"]
    articles = etree.Element("articles",attrib={"schema-version":"137"})
    for items in data:
        counter+=1
        title=items["title"]
        content = items["summary"]["content"]
        guid= "SHA1:"+hashlib.sha1(content.encode("utf-8")).hexdigest()
        link = items["canonical"][0]["href"]
        feed_title = items["origin"]["title"]
        feed_url=items["origin"]["streamId"][5:]
        updated=datetime.utcfromtimestamp(items["published"]).strftime('%Y-%m-%d %H:%M:%S')
        
        article = etree.SubElement(articles, "article")
        guidxml =  etree.SubElement(article, "guid")
        guidxml.text = etree.CDATA(guid)
        titlexml =  etree.SubElement(article, "title")
        titlexml.text = etree.CDATA(title)
        contentxml =  etree.SubElement(article, "content")
        contentxml.text = etree.CDATA(content)
        markedxml=  etree.SubElement(article, "marked")
        markedxml.text ="1"
        markedxml=  etree.SubElement(article, "published")
        markedxml.text ="0"
        markedxml=  etree.SubElement(article, "score")
        markedxml.text ="0"
        markedxml=  etree.SubElement(article, "note")
        markedxml=  etree.SubElement(article, "link")
        markedxml.text =etree.CDATA(link)
        markedxml=  etree.SubElement(article, "tag_cache")
        markedxml=  etree.SubElement(article, "feed_title")
        markedxml.text =etree.CDATA(feed_title)
        markedxml=  etree.SubElement(article, "feed_url")
        markedxml.text =etree.CDATA(feed_url)
        markedxml=  etree.SubElement(article, "updated")
        markedxml.text =etree.CDATA(updated)
    outputxml = etree.tostring(articles, pretty_print=True,encoding="utf-8")
    with open(filename+".xml","wb") as opt:
        opt.write(outputxml)
    print(outputxml.decode('utf-8'))
    print(counter)

Then I tried to upload the XML file, but found many problems. First it won’t upload due to the max_header_size on PHP and Nginx side. I increased both to 120MB, and then it worked for file <2MB.

However, when I try uploading a 10MB XML file, it shows

{"error":{"code":13,"message":\u627e\u4e0d\u5230\u65b9\u6cd5(Method)"}}

Is the file too large?

fox · April 9, 2019, 1:42pm

it probably is too large for DOMDocument to load (see memory_limit in php.ini)

i suggest generating several files of smaller size

e: btw, if you have import_export enabled in config.php you can import from command line using update.php

e: also, it should be post_max_size to allow larger files to upload

zeed · April 9, 2019, 1:54pm

Great!
Now I imported everything from the my Innoreader.

But I also found a problem during the process. The imported articles are not searchable, and the Feed Debugger trick doesn’t seem to work for the imported articles. So I have to manually update index with SQL .

update ttrss_entries set tsvector_combined = to_tsvector(content);

fox · April 9, 2019, 2:00pm

thanks for reporting this; tsvector_index has likely been added after this plugin was initially written, i’ll make a note to update it so that the index is generated properly

fox · April 10, 2019, 10:19am

https://git.tt-rss.org/fox/tt-rss/commit/d32e191ad7e844cac1943f5352f0c4828eb71525

i didn’t test it but it should work