[SOLVED] Search in Chinese

zeed · April 8, 2019, 5:18am

Describe the problem you’re having:

Search in Chinese returns nothing with simple or English mode

tt-rss version (including git commit id):

v19.2

Platform (i.e. Linux distro, PHP, PostgreSQL, etc) versions:

PostgreSQL

Please provide any additional information below:

Is it possible to search Chinese (or maybe other languages) with the plugin of Sphinx?

fox · April 8, 2019, 5:59am

looks like postgresql doesn’t support full text search for Chinese or Japanese out of the box

https://pgroonga.github.io/ - can you try this?

zeed · April 8, 2019, 8:29am

I installed a Chinese plugin, and I tried

CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION chinese_zh (PARSER = zhparser); 
ALTER TEXT SEARCH CONFIGURATION chinese_zh ADD MAPPING FOR n,v,a,i,e,l,t WITH simple;

on postgre. But it doesn’t work.

fox · April 8, 2019, 8:56am

this isn’t really related to tt-rss, you probably would have better results asking people who developed this postgresql extension

does Chinese appear in the list of search languages? if it doesn’t maybe more configuration of postgresql is needed

zeed · April 8, 2019, 9:36am

Yes, Chinese_zh appeared in the list of search languages.

Actually it works for some words, so I assume it is related to the vocabulary size of the postgresql extension.

Thank you for the suggestion!

fox · April 8, 2019, 9:46am

great. if you manage to make it work properly, post here, so that other people might find it too

zeed · April 8, 2019, 11:45am

I found a problem when debugging with the extension.

I tired

select title from ttrss_entries where to_tsvector( content) @@ plainto_tsquery('simple','底层支撑');

directly on postgresql, and there is nothing found. This result is as expected, since “底层支撑” is not found as a single word in the contents.

Then I tried

SET default_text_search_config  = 'Chinese_zh'
select title from ttrss_entries where to_tsvector( content) @@ plainto_tsquery('底层支撑');

and there are some results. The results are as expected, since ‘底层支撑’ is now split into “底层” and “支撑” by the extension, and there are documents that contains these two words.

However, when I try it with the search dialog on tt-rss side, there is nothing found. I also tried selecting “Chinese_zh” on the list of search languages, but it doesn’t make any difference.

I wonder what could be the cause of the problem. Can I change the default search language settings of tt-rss?

fox · April 8, 2019, 12:13pm

relevant code which transforms search query to sql is here: https://git.tt-rss.org/fox/tt-rss/src/master/include/functions.php#L1362

you can try adding some debugging or enabling query logging in postgresql to see what exactly does tt-rss generate

i’d say its either case conversions or tsvector_combined not being filled correctly for this language (you might need to set per-feed language in feed editor) AND run it through feed debugger with force rehash afterwards so that index updates

try running searches against tsvector_combined instead of to_tsvector(content) because that’s what tt-rss does for performance reasons.

zeed · April 8, 2019, 12:21pm

Actually I changed per-feed language in feed editor already.

I tired

select title from ttrss_entries where tsvector_combined @@ plainto_tsquery('Chinese_zh','底层支撑');

and there is no return.

I assume that tsvector_combined is not updated and it is still using SEARCH CONFIGURATION ‘simple’

Does it requires a re-indexing after changing the per-feed language in feed editor?

fox · April 8, 2019, 12:22pm

yes, index is updated when articles are processed, so unless you run the feed through feed debugger it will only apply to articles which were added afterwards

you can easily rebuild the index through postgresql console btw

zeed · April 8, 2019, 12:24pm

How can I run the feed through feed debugger?

Or how can I rebuild the index? I tried REINDEX, but it doesn’t seem to work

fox · April 8, 2019, 12:27pm

hotkey f D on the feed

something like this:

update ttrss_entries set tsvector_combined = to_tsvector('Chinese_zh', content);

this will update everything for Chinese_zh, you’ll need to limit the query for specific feeds if you want to

i suggest going through feed debugger instead, because there’s a possibility of other minor issues if you try to create tsvector index from complete articles (it’s length-limited so you may have some errors, etc)

zeed · April 8, 2019, 12:37pm

I tired the first one but there is no visual feedback.
So I tired updating the tsvector directly, and it worked.

Now the search results are correct for all the Chinese words that I am testing with.

Thank you very much!!

fox · April 8, 2019, 12:50pm

since we’re undergoing hotkey-related troubles, i’ll add feed debugger to the context menu so it would be easier to trigger:

zeed · April 8, 2019, 2:22pm

That would be great!

zeed · April 9, 2019, 1:59pm

Is it possible to add a configuration for the default searching language?

Now I have to hard code the language, so that I don’t have to select Chinese everytime when I search.

https://git.tt-rss.org/fox/tt-rss/src/master/include/functions.php#L1374

fox · April 9, 2019, 3:04pm

yeah this definitely needs a global option

fox · April 10, 2019, 10:10am

i’ve added an option to set default stemming language: https://git.tt-rss.org/fox/tt-rss/commit/6768b3a4a3261b32c552e1acf1c471cd39b04a8a and following changesets