Tsvector issue with strip_tags

Organizer · July 29, 2018, 6:41pm

After debugging some recent challenges with not being able to find matches based on tiny’s full text to_tsvector field I might have come across an issue. This was fetched using a January TT master version and I have later updated but I believe that the code would produce the same result.

This is what readability fetched (but I guess it would can be the same in a normal fetch):

<div id="ncontent" readability="37.093559617058">
            <img itemprop="image" src="http://www.jeuxactu.com/datas/evenements/g/a/gamescom-2018/vn/gamescom-2018-5b3c95a52a379.jpg" class="fleft visu_news_show autovn" alt="Gamescom 2018 : Bandai Namco dévoile tous ses jeux jouables sur le salon, la liste ici"/><div class="medium_txt" itemprop="articleBody" readability="22.537859007833">Comme chaque année, la Gamescom se tiendra à Cologne et sera le théâtre du mois d'août des jeux vidéo. Animations, annonces, jeux exclusifs : l'événement allemand tient toujours sa forte position dans le cadre vidéoludique et c'est aujourd'hui Bandai Namco qui fait une déclaration concernant l'ensemble de ses titres présents sur les stands. Ce n'est donc pas moins de sept produits qui feront leur apparition avec, notamment, certaines aventures très attendues annoncées à l'E3 2018 comme Jump Force, un jeu de combat mêlant tous les héros d'animes et mangas. On regarde ça tout de suite. <br/><ul><li>Jump Force </li><li>SoulCalibur VI</li><li>Dragon Ball FighterZ (Switch)</li><li>One Piece World Seeker</li><li>Code Vein</li><li>Ace Combat 7 : Skies Unknown</li><li>My Hero One's Justice</li><li>Naruto To Boruto : Shinobi Striker</li></ul>À savoir que nous avons justement pu récupérer <a href="http://www.jeuxactu.com/l-emission-jeuxactu-05-vacances-et-bouees-avec-jhon-rachid-114476.htm" class="news" target="_blank">la version jouable de SoulCalibur VI lors de notre dernière émission</a> : l'occasion idéale d'y assister longuement en avant-première ! Enfin, on rappelle que la Gamescom se tiendra du 21 au 25 août prochain et accueillera d'autres perles en exclusivité, dont Devil May Cry 5 qui y sera jouable pour la première fois.</div>          
          <br class="clear"/></div>

This produced am tsvector like this:

‘2018’:2,96 ‘21’:180 ‘25’:182 ‘5’:196 ‘7’:131 ‘accueillera’:186 ‘allemand’:42 ‘animations’:36 ‘animes’:109 ‘annonces’:37 ‘annoncées’:92 ‘année’:18 ‘août’:32,183 ‘apparition’:85 ‘assister’:165 ‘attendues’:91 ‘au’:181 ‘aujourd’:55 ‘autres’:188 ‘avant’:169 ‘avant-première’:168 ‘avec’:86 ‘aventures’:89 ‘avons’:145 ‘ball’:122 ‘bandai’:3,57 ‘boruto’:139 ‘c’:53 ‘cadre’:50 ‘ce’:73 ‘certaines’:88 ‘chaque’:17 ‘cologne’:24 ‘combat’:103,130 ‘comme’:16,97 ‘concernant’:63 ‘cry’:195 ‘d’:31,108,163,187 ‘dans’:48 ‘de’:66,79,102,116,152,156 ‘dernière’:158 ‘des’:33 ‘devil’:193 ‘donc’:76 ‘dont’:192 ‘du’:29,179 ‘déclaration’:62 ‘dévoile’:5 ‘e3’:95 ‘en’:167,190 ‘enfin’:171 ‘ensemble’:65 ‘est’:54,75 ‘et’:25,52,110,185 ‘exclusifs’:39 ‘exclusivité’:191 ‘fait’:60 ‘feront’:83 ‘fighterz’:123 ‘fois’:204 ‘force’:99,119 ‘forte’:46 ‘gamescom’:1,20,176 ‘hero’:134 ‘hui’:56 ‘héros’:107 ‘ici’:15 ‘idéale’:162 ‘jeu’:101 ‘jeux’:8,34,38 ‘jouable’:151,200 ‘jouables’:9 ‘jump’:98,118 ‘justement’:146 ‘justicenaruto’:137 ‘l’:40,64,94,160 ‘la’:13,19,149,175,202 ‘le’:11,27,49 ‘les’:71,106 ‘leur’:84 ‘liste’:14 ‘longuement’:166 ‘lors’:155 ‘mangas’:111 ‘may’:194 ‘moins’:78 ‘mois’:30 ‘mêlant’:104 ‘n’:74 ‘namco’:4,58 ‘notamment’:87 ‘notre’:157 ‘nous’:144 ‘occasion’:161 ‘on’:112,172 ‘one’:125,135 ‘pas’:77 ‘perles’:189 ‘piece’:126 ‘position’:47 ‘pour’:201 ‘première’:170,203 ‘prochain’:184 ‘produits’:81 ‘présents’:69 ‘pu’:147 ‘que’:143,174 ‘qui’:59,82,197 ‘rappelle’:173 ‘regarde’:113 ‘récupérer’:148 ‘s’:136 ‘sa’:45 ‘salon’:12 ‘savoir’:142 ‘se’:21,177 ‘seekercode’:128 ‘sept’:80 ‘sera’:26,199 ‘ses’:7,67 ‘shinobi’:140 ‘skies’:132 ‘soulcalibur’:120,153 ‘stands’:72 ‘strikerà’:141 ‘suite’:117 ‘sur’:10,70 ‘switch’:124 ‘théâtre’:28 ‘tiendra’:22,178 ‘tient’:43 ‘titres’:68 ‘to’:138 ‘toujours’:44 ‘tous’:6,105 ‘tout’:115 ‘très’:90 ‘un’:100 ‘une’:61 ‘unknownmy’:133 ‘veinace’:129 ‘version’:150 ‘vi’:154 ‘vidragon’:121 ‘vidéo’:35 ‘vidéoludique’:51 ‘world’:127 ‘y’:164,198 ‘à’:23,93 ‘ça’:114 ‘émission’:159 ‘événement’:41

When I was search for “code vein” I could not get a result up, which seems to be that the < li > tags get stripped thereby combining the last word of the < li > above with the next bullet/lines first word, in this case creating:

‘seekercode’:128

Now I search tiny from an external script, though at glance this seems to be an issue in the way the field is being generated unless I’m mistaken, which happens quite a lot when it comes to coding

One code place is rssutils.php (likely one more place with a similar code as well i guess):
$params[":ts_content"] = mb_substr(strip_tags($entry_title . " " . $entry_content), 0, 900000);

Any thoughts?

fox · July 29, 2018, 6:48pm

http://php.net/manual/en/function.strip-tags.php

read the warnings in red

e: strip_tags() is secure in that it removes the tags but since it doesn’t use DOM it can mangle content, the alternative is parsing everything again and extracting text from DOM nodes which would be orders of magnitude slower, all for the sake of tsvector.

e2: to be fair aforementioned orders of magnitude would probably be literal milliseconds added for each article processed, still though is it really necessary

Organizer · July 29, 2018, 7:03pm

I’m sorry, but I don’t see the relation here… unless you feel that this is bad RSS code and thereby is something that TT should not be required to correct - which I’d disagree on in this case (TT choses to use strip_tags in this case, which I think was a regex in earlier versions, though likely producing a similar result).

This is valid code, not partial and not broken tags, or I am missing something:

Turning a valid RSS content code with full tags:
<ul><li>One Piece World Seeker</li><li>Code Vein</li></ul>

Into this index, can’t be right, no? Seeker and code lacking a space as to be treated as separate words.
'one':1 'piece':2 'seekercode':4 'vein':5 'world':3

It’s clearly being broken due to how it’s processed (not due to impartial or un-closed tags or something… so creating the tsvectors would need an additional fix before tags are stripped and the DB does its thing).

Update: If you feel its not on TT to fix, I’ll have to add something to my code to ensure there’s a space there before TT creates the full text index as if not any RSS content with bullet points face the risk of the text of certain points not being full text searchable afterwards with the current implementation as far as I can understand at least.

fox · July 29, 2018, 7:18pm

naboo:~:$ cat ./test.php
<?php
        echo strip_tags('<ul><li>One Piece World Seeker</li><li>Code Vein</li></ul>') . "\n";
?>

naboo:~:$ php ./test.php
One Piece World SeekerCode Vein

tt-rss uses strip_tags() because it’s fast, not because it’s always correct. in fact it’s impossible for this function to be correct all the time.

like i said above, i’m not sure if going through html parsing again at that point is worth it to ensure correctness of the search index. i’ll think about it.

edit: in this particular case not even full HTML parsing via DOM would be any better

naboo:~:$ cat ./test.php
<?php
        $html = '<ul><li>One Piece World Seeker</li><li>Code Vein</li></ul>';

        $doc = DOMDocument::loadHTML($html);

        print $doc->textContent . "\n";
?>
naboo:~:$ php ./test.php
One Piece World SeekerCode Vein

if only there was a space between those </li><li> tags

Organizer · July 29, 2018, 7:26pm

Thanks, yes there’s always a give and take as to speed I guess… not sure this happens with all ending and starting words with all tags or if < li > is a prime candidate as to cause this. If one would exclude < li > in the strip tags

$params[":ts_content"] = mb_substr(strip_tags($entry_title . " " . $entry_content, '<li>'), 0, 900000);

would that cause any real issues or drawbacks? I see the tsvector result at least would be resolved, though from what I see strip tags in theory does the same for < p > < span > < div > etc. so that’s maybe far from a 100% solution unless there’s often some other line breaks or something in code that makes those less prune.

'code':5 'one':1 'piece':2 'seeker':4 'vein':6 'world':3

I’m sure you’d hate it, but maybe an option in config for an extra option that would run a preg_replace fix before stripping check to ensure more consistency for those that do need this (which might be more rare). I tested with a preg_replace('/<[^>]*>/', xxx); myself which resolved my test but at a cost and maybe not the best option. Not sure I am small with my 2000+ feeds and or I have too much server power, but generally whatever I’ve added filter etc. wise does not seem to slow things down for me at least.

fox · July 30, 2018, 4:03am

i’m definitely not adding any options but allowing <li> shouldn’t cause any issues, other than maybe having “li” keyword in the search index.

klatch · July 30, 2018, 11:31am

strip_tags(str_replace('<', ' <', $bla));

fox · July 30, 2018, 12:09pm

^ actually is not a bad hack

klatch · July 30, 2018, 12:28pm

Unless you can expect HTML such as:
hel<span class="mark">lo</span>

Every one of the suggested hacks (mine included) has a downside.

JustAMacUser · July 30, 2018, 1:28pm

Regex replace with look behind/ahead for word characters?

shabble · July 30, 2018, 2:06pm

Jamie Zawinski just called…

JustAMacUser · July 30, 2018, 3:32pm

It’s not that bad…

preg_replace( '/(\W)</', '$1 <', $s );

Replace less-than with space/less-than if the preceding character is a non-word character. I didn’t test this, but something similar should work.

e: or for the negative lookbehind: preg_replace( '/(?<!\w)\</', ' <', $s );

Athanasius · July 30, 2018, 6:08pm

Of course a general replace on ‘<’ isn’t going to work, because that’s not the issue.

The issue is that semantically an LI element includes a line break when closed, and it is that which normally breaks apart the last word in the prior list item from the first word of the next list item.

So just replace only the closing of an LI element with a newline (assuming that works for tsvector processing, else a space).

Or has anyone spotted the same issue with P elements and the like as well ?

Edit: OK, so the preview lies about if things are coming through literally or not…

JustAMacUser · July 30, 2018, 10:29pm

It’s complicated because HTML can be embedded nearly anywhere. The middle of a word, the middle of a sentence, etc. And that’s assuming the source is standards-compliant. Non-standard stuff gets more tricky.

This is the kind of stuff a human can look at and just say, “Yeah, this goes here,” but a computer struggles with.

Personally, I’d just leave the code as is. But a workable solution is to find common tags that are used inline and adjust them accordingly.

klatch · July 31, 2018, 11:50am

You’ve precisely described the problem, and why none of these hacks cover 100% of the cases they intend to remedy: you cannot automatically infer if HTML element placement is mid-word or not, even if this happens relatively little. Even among inline element usage you won’t know, which is why your last suggestion shares the same pitfall, regardless of proper semantic usage.

+1

It’s either that or trading the current flaw to this mechanism for another one. (i.e. assume no element is mid-word)

… until fox adds machine-learning and a dictionary table to ttrss for just this particular problem. Which I’m sure will be any day now.

Athanasius · July 31, 2018, 12:57pm

I rather think that’s throwing the baby out with the bath water.

The case of list elements is very clear cut. There’s an explicit break between last word of previous element and first word of next element, simply due to how any sane rendering of them works. Someone who’s bothered should test if things like paragraph elements suffer the same issue in this code path. Trying to apply it to span is likely misguided.

Agreed that trying to do this for the general case of just any (closing) tag is insane.

klatch · August 1, 2018, 11:40am

Sure, you could add exceptions for p and li elements. I have the same problem with hx elements, let’s add those too. My neighbor complained about div. Someone I just met on the street yelled “br!!”, he has a good point too. You know what, why not add exceptions for all block level elements?

What about span? Are you some kind of span-racist? It’s just a general purpose inline element, with no semantic value, that is assigned block-level visuals or non-standard positions more often than any other inline elements…

What I’m saying is, you can’t cover all cases going this route. That may be okay, you could still resolve a lot of cases by trying to give i.e. the most-occurring block level elements special treatment. But in doing so, you’re giving birth to an ugly hack meant to resolve what is essentially a tiny issue.

Personally, I stay clear of adding an ugly, arbitrary selective hack.

fox · August 1, 2018, 11:45am

i’m on the edge of my seat waiting for the first request to add a plugin hook in there

Athanasius · August 1, 2018, 11:46am

Or, assuming it reliably does what it says on the tin, this could be used: GitHub - mtibben/html2text: PHP library to convert HTML to formatted plain text

After all, that’s what the problem case is; needing to reliably extract the plain text from HTML, conserving all implied white space, but only that.

Cue fox objecting to adding another dependency ;).

fox · August 1, 2018, 11:49am

indeed i’m definitely not adding special-case hacks, plugin hooks, AI, blockchain, more libraries, etc because of tsvector being not completely accurate sometimes