Tiny Tiny RSS: Community

[SOLVED] Search plugin of tt-rss

#1

Describe the problem you’re having:
The built-in fulltext search in postgresql doesn’t offer ranking algorithms like BM25 like Sphinx or elasticsearch. And the reuslts are hard to view when you have many articles matching the query. So I am thinking about setting up a Sphinx for tt-rss.

I wonder whether the Sphinx plugin would build Sphinx index automatically. Should I import data from postgresql to sphinx mannually?

tt-rss version (including git commit id):

19.2

Platform (i.e. Linux distro, PHP, PostgreSQL, etc) versions:

PostgreSQL

#2

wiki -> sphinx search

#3

I tried sphinx, but it seems that it only supports English and Russian, and there are few plugins for CJK languages.

However, there are many Lucene-based solutions like Elasticsearch around that has great CJK language support. I wonder if it is hard to make a new plugin for Elasticsearch.

Do I only need to fetch the ref_ids?

#4

Here is a sample code for the plugin that I wrote.

<?php
class Search_External extends Plugin {
        function about() {
                return array(2.0,
                        "Delegate searching for articles to any external serach program",
                        "zeed");
        }

        function init($host) {
                $host->add_hook($host::HOOK_SEARCH, $this);

        }

        function hook_search($search) {
                // fetch $ids_search from the external search program.
                // eg. elasticsearch + logstash
            $override_order = false;
            $search_query_part = "ttrss_entries.id = -1";
            $query_join_score_part = "";
            if (count($ids_search)>0){
			    $query_join_score_part = " LEFT JOIN (VALUES ";
			    $query_join_score_part .= "(" . $ids_search[0] . ",0) ";
			    for ($iter = 1; $iter < count($ids_search); $iter++) {
                    $query_join_score_part .= ",(" . $ids_search[$iter] . "," . $iter .")";
                } 
			    $query_join_score_part .= " ) AS temp_table(id2, score2) ON (temp_table.id2 = ttrss_entries.id) ";
			    $override_order =" temp_table.score2 ";
			    $search_query_part = "ttrss_entries.id IN (". join(",", $ids_search) . ")";
            }
			return array($search_query_part,$query_join_score_part,$override_order, array($search));
        }

        function api_version() {
                return 2;
        }
}

However, the nicely sorted results from elasticsearch , are sorted again using non-BM25 based method by ttrss since “ttrss_entries.id IN ($ids)” doesn’t transfer the ordering information to the ORDER BY command later.

Is there any way to keep the order of items fetched from an external search program for the following SQL queries?

#5

Actually Sphinx and Elasticsearch supports returing a score representing the relevance of the result and the qurey string.

Is there anyway to use this score in the existing feeds.php?
For example, can we add a return value to hook_search to for injecting GROUP BY SCORE parts into the ending apart of the SQL query in feeds.php?

Currently it only injects to somewhere after WHERE .

#6

no, that’s definitely not going to happen
maybe there’s a better way although i’m not sure what it could be

#7

I also think letting plugins to write SQL queries might cause a lot of problems in the future.

I saw there is an order by score, but it uses the “score” row, without considering the query string. Do you think it is good to add another score_by_qureystring , using an array of scores returned from the hook? I think Sphinx, Elasticsearch, and any external search methods could give this score easily.

#8

It would also be great if ttrss could stop updating tsvector when external search plugins is being used.
By setting tsvectors to null I can reduce my DB size by half.

Can we simply add this to every command that updates the tsvector?
if (strpos(PLUGINS, "search_") === FALSE) {
or maybe
if (count(PluginHost::getInstance()->get_hooks(PluginHost::HOOK_SEARCH)) == 0)

#9
  1. database size is a non-issue. storage is extremely cheap.
  2. it’s a very obtuse change with consequences which would not be immediately obvious:
    • what if plugin provides HOOK_SEARCH but still wants to base on tsvector?
    • it wouldn’t be obvious that search index would need to be regenerated when search plugin is disabled
    • i could probably invent more examples but it’s pointless because of 1.

i also can’t help but notice that you’re pushing for very specific changes that only benefit your, very particular, setup. it’s not necessarily a bad thing but let’s not forget about many other people who are also using tt-rss. i’m not going to implement questionable hacks at their expense.

#10

I only tried them on my side and I post my hacks here because I thought it might be useful for someone who is also using Sphinx plugin. I didn’t realize there could be so many bad consequences, though.

#11

i dunno, is it possible to ORDER BY (array of ids)? i’m not sure how that should work tbh.

#12

I wrote a dirty hack by updating the score of article into the DB every time according to the score returned by each sphinx query. Then the aricles got ordered by the “score DESC”. I didn’t share it because it is appreantly a terrible solution.

I also found a way by setting a temproary ordering using join, but I haven’t tried it yet. The id and score can be fetched from plugin and then supplied via SQL query.

EDIT: I added

    $query_join_score_part = "";
	if ($search) {
		$search_query_part = "";

		foreach (PluginHost::getInstance()->get_hooks(PluginHost::HOOK_SEARCH) as $plugin) {
			list($search_query_part,$query_join_score_part, $override_order, $search_words) = $plugin->hook_search($search); 
			break;
		}

and also changed the join to

		if (!$allow_archived) {
			$from_qpart = "${ext_tables_part}ttrss_feeds,ttrss_entries LEFT JOIN ttrss_user_entries ON (ref_id = ttrss_entries.id)";
			$feed_check_qpart = "ttrss_user_entries.feed_id = ttrss_feeds.id AND";

		} else {
			$from_qpart = "${ext_tables_part}ttrss_entries LEFT JOIN ttrss_user_entries ON (ref_id = ttrss_entries.id)
					LEFT JOIN ttrss_feeds ON (feed_id = ttrss_feeds.id)";
		}
        $from_qpart .= 	$query_join_score_part;
#13

yeah i thought about that too and yes it’s a pretty terrible solution :slight_smile:

#14

I tried the second way and it worked nicely. :slight_smile:

#15

catchup_feed seesm to be using search_to_sql without hook. Could be a problem?

#16

yeah, it definitely should invoke the search hook. thanks for noticing.

#17

i guess this would work but it’s definitely too convoluted to be included in trunk. queryfeedheadlines() is already way too complicated for its own good, i would prefer to not add even more code i would struggle to understand a year later.

#18

I agree.
BTW, what is function “catchup_feed” for?

#19

it’s actions -> mark as read.

#20

i haven’t tested it but it should work.