Hi all! :)
i'm writing my anti spam/badwors filter and i need if is possible,
to match (detect) only words formed by mixed characters like: fr1&nd$ and not friends
is this possible with regex!?
best regards!
-
You could build some regular expressions like the following:
\p{L}+[\d\p{S}]+\S*
This will match any sequence of one or more letters (
\p{L}+
, see Unicode character preferences), one or more digits or symbols ([\d\p{S}]+
) and any following non-whitespace characters\S*
.$str = 'fr1&nd$ and not friends'; preg_match('/\p{L}+[\d\p{S}]+\S*/', $str, $match); var_dump($match);
-
It is possible, you will not have very pretty regex rules, but you can match basically any pattern that you can describe using regex. The tricky part is describing it.
I would guess that you would have a bunch of regex rules to detect bad words like so:
To detect fr1&nd$, friends, fr*nd you can use a regex like:
/fr[1iI*][&eE]nd[s$Sz]/
Doing something like this for each rule will find all the variations of possible characters in the brackets. Pick up a regex guide for more info.
(I'm assuming for a badwords filter you would want
friend
as well asfrie**
, you may want to mask the bad word as well as all possible permutations)Chris Lutz : I got bored and did this once in Perl. The regexes do look pretty hideous, especially when you try to account for misspellings. -
Of course it's possible with regex! You're not asking to match nested parentheses! :P
But yes, this is the kind of thing regular expressions were built for. An example:
/\S*[^\w\s]+\S*/
This will match all of the following:
@ss as$ a$s @$s a$$ @s$ @$$
It will not match this:
ass
Which I believe is what you want. How it works:
\S*
matches 0 or more non-space characters.[^\w\s]+
matches only the symbols (it will match anything that isn't a word or a space), and matches 1 or more of them (so a symbol character is required.) Then the\S*
again matches 0 or more non-space characters (symbols and letters).If I may be allowed to suggest a better strategy, in Perl you can store a regex in a variable. I don't know if you can do this in PHP, but if you can, you can construct a list of variables like such:
$a = /[aA@]/ # regex that matches all a-like symbols $b = /[bB]/ $c = /[cC(]/ # etc...
Or:
$regex = array( 'a' => /[aA@]/, 'b' => /[bB]/, 'c' => /[cC(]/, ... );
So that way, you can match "friend" in all its permutations with:
/$f$r$i$e$n$d/
Or:
/$regex['f']$regex['r']$regex['i']$regex['e']$regex['n']$regex['d']/
Granted, the second one looks unnecessarily verbose, but that's PHP for you. I think the second one is probably the best solution, since it stores them all in a hash, rather than all as separate variables, but I admit that the regex it produces is a bit ugly.
St. John Johnson : Awesome Regex + Explanation +1! Btw, Regex in PHP is stored in strings, so having variable permutations like you suggest is certainly possible.St. John Johnson : Actually, it might be interesting to write that into a function. Pass in a normal word, and it would reply with the correct regex to detect that word. Only issue I could see is something like W = \/\/ or anything multi-character.Chris Lutz : W = !(?:[wW]|\\/\\/)! (in my native Perl). It would be more difficult for things like W with multi-character matches, but certainly possible. A function could easily be written that goes through a string, character-by-character, and looks up a regex to match that character, and then assembles them all into one giant (horrible-looking) regex, which you can use to match that word. However, I don't use PHP often enough to do it. I might do it in Perl if the whim strikes me. Or whatever that expression is supposed to be. -
Didn't test this thoroughly, but this should do it:
(\w+)*(?<=[^A-Za-z ])
Chris Lutz : This matches "a " (word followed by spaces).dr Hannibal Lecter : My bad :) I've changed it, the extra space should do it.Chris Lutz : I would go for tabs too, but this should work.
0 comments:
Post a Comment