Regular expressions containing the symbol "<" do not work in filters

When I try to add a regular expression containing the symbol “<” (which is necessary for e.g. lookbehind assertions) as a filtering condition, this is the result that I get:

13

I am running the latest Git version of tt-rss on macOS 10.14.5, PHP 7.3.7, MySQL 8.0.16.

this is because ‘<’ opens html tags which are stripped out

So how do I use lookbehind assertions in regular expressions? Is there a way to escape them?

i guess you’ll have to find another way to solve your actual problem, whatever it might be.

no.

https://discourse.tt-rss.org/t/html-in-filters-not-possible-any-more/766

html not being allowed in filters has been reported before but there’s issues with enabling this so it’s unlikely to happen.

What is your actual filter?
Are you putting it inside ()?
I used to use pos & neg lookbehinds all the time and had no issues.
Fox fixed the issue I was having with HTML being stripped.

really? as far as i remember html is still being stripped from filters.

The regexp I’m trying to add is this: (?<!не )ищу отношени(я|й)

It’s been a few years, but I remember having an issue and you tweaked something because I was no longer having any problems with my filters using lookbehinds and lookaheads.

Now I no longer use look (ahead|behind), so I wasn’t aware of any issues.

I was trying to find issue i submitted, but it was before the switch to discourse. I’ll try to find it. Maybe i kept something locally on my box.

I found it on the “OLD FORUM”. It was from 2015.

I was using filter: (?<!peter )(parker) and the < was causing an issue.

See exchange here:
Posts from OLD FORUM

yeah i’m afraid this will get filtered currently, it was changed sometime after the PDO overhaul i think.

btw as a terrible workaround you can add (or update) whatever regular expression directly in the database, stripping only happens in the actual editor UI. as long as you don’t edit the filter afterwards it’ll work.

Not the only place it gets stripped out it seems…

Screenshot%20from%202019-07-15%2011-29-28

Feed:

      <item>
        <title>Regular expressions containing the symbol &quot;&lt;&quot; do not work in filters</title>
        <dc:creator><![CDATA[@Avoozl]]></dc:creator>
        <description><![CDATA[ <p>So how do I use lookbehind assertions in regular expressions? Is there a way to escape them?</p> ]]></description>
        <link>https://discourse.tt-rss.org/t/regular-expressions-containing-the-symbol-do-not-work-in-filters/2609/3</link>
        <pubDate>Sun, 14 Jul 2019 17:17:50 +0000</pubDate>
        <guid isPermaLink="false">discourse.tt-rss.org-post-9248</guid>
      </item>
      <item>
        <title>Regular expressions containing the symbol &quot;&lt;&quot; do not work in filters</title>
        <dc:creator><![CDATA[@fox]]></dc:creator>
        <description><![CDATA[ <p>this is because ‘&lt;’ opens html tags which are stripped out</p> ]]></description>
        <link>https://discourse.tt-rss.org/t/regular-expressions-containing-the-symbol-do-not-work-in-filters/2609/2</link>
        <pubDate>Sun, 14 Jul 2019 17:17:09 +0000</pubDate>
        <guid isPermaLink="false">discourse.tt-rss.org-post-9247</guid>
      </item>
      <item>
        <title>Regular expressions containing the symbol &quot;&lt;&quot; do not work in filters</title>

that’s strange, if source properly escapes it to &lt; then it shouldn’t get removed by tt-rss element filter.

in any case in those kind of situations i think it’s better to remove too much from time to time than let something through.

e: looks like this is exclusive to title, where tt-rss uses php native strip_tags() instead of DOM filter, i think, maybe it’s a bit too aggressive.

I just hit this issue with a negative lookbehind when attempting to filter topics discussing a domain name but without matching email addresses at that domain: (?<!@)example.com.

It looks like it’s getting stripped right away in newrule():

$rule = json_decode(clean($_REQUEST["rule"]), true);

if ($rule) {
	$reg_exp = htmlspecialchars($rule["reg_exp"]);
	$filter_type = $rule["filter_type"];
	$feed_id = $rule["feed_id"];
	$inverse_checked = isset($rule["inverse"]) ? "checked" : "";
} else {
	$reg_exp = "";
	$filter_type = 1;
	$feed_id = ["0"];
	$inverse_checked = "";
}

Absent of a dedicated workflow or system for controlling the lifecycle of user input to protect against stored-XSS I wonder this string could be treated as an HTML-encoded string in the DB and htmlspecialchars_decode used on it only when building the regex. That way if it leaks then it’ll leak encoded but it’ll still be usable for filters that might use lookbehinds or match characters that are filtered.

so, if you remove clean there, does it save properly? i remember poking at this when this was originally posted but for some reason decided against changing anything, don’t remember why though.

having markup there shouldn’t really be that big a deal, the worst you could do is somehow script inject yourself, i think.