Friday, May 6, 2011

Regex - Match ( only ) words with mixed chars

Hi all! :)

i'm writing my anti spam/badwors filter and i need if is possible,

to match (detect) only words formed by mixed characters like: fr1&nd$ and not friends

is this possible with regex!?

best regards!

From stackoverflow
  • You could build some regular expressions like the following:

    \p{L}+[\d\p{S}]+\S*
    

    This will match any sequence of one or more letters (\p{L}+, see Unicode character preferences), one or more digits or symbols ([\d\p{S}]+) and any following non-whitespace characters \S*.

    $str = 'fr1&nd$ and not friends';
    preg_match('/\p{L}+[\d\p{S}]+\S*/', $str, $match);
    var_dump($match);
    
  • It is possible, you will not have very pretty regex rules, but you can match basically any pattern that you can describe using regex. The tricky part is describing it.

    I would guess that you would have a bunch of regex rules to detect bad words like so:

    To detect fr1&nd$, friends, fr*nd you can use a regex like:

    /fr[1iI*][&eE]nd[s$Sz]/

    Doing something like this for each rule will find all the variations of possible characters in the brackets. Pick up a regex guide for more info.

    (I'm assuming for a badwords filter you would want friend as well as frie**, you may want to mask the bad word as well as all possible permutations)

    Chris Lutz : I got bored and did this once in Perl. The regexes do look pretty hideous, especially when you try to account for misspellings.
  • Of course it's possible with regex! You're not asking to match nested parentheses! :P

    But yes, this is the kind of thing regular expressions were built for. An example:

    /\S*[^\w\s]+\S*/
    

    This will match all of the following:

    @ss
    as$
    a$s
    @$s
    a$$
    @s$
    @$$
    

    It will not match this:

    ass
    

    Which I believe is what you want. How it works:

    \S* matches 0 or more non-space characters. [^\w\s]+ matches only the symbols (it will match anything that isn't a word or a space), and matches 1 or more of them (so a symbol character is required.) Then the \S* again matches 0 or more non-space characters (symbols and letters).

    If I may be allowed to suggest a better strategy, in Perl you can store a regex in a variable. I don't know if you can do this in PHP, but if you can, you can construct a list of variables like such:

    $a = /[aA@]/ # regex that matches all a-like symbols
    $b = /[bB]/
    $c = /[cC(]/
    # etc...
    

    Or:

    $regex = array( 'a' => /[aA@]/, 'b' => /[bB]/, 'c' => /[cC(]/, ... );
    

    So that way, you can match "friend" in all its permutations with:

    /$f$r$i$e$n$d/
    

    Or:

    /$regex['f']$regex['r']$regex['i']$regex['e']$regex['n']$regex['d']/
    

    Granted, the second one looks unnecessarily verbose, but that's PHP for you. I think the second one is probably the best solution, since it stores them all in a hash, rather than all as separate variables, but I admit that the regex it produces is a bit ugly.

    St. John Johnson : Awesome Regex + Explanation +1! Btw, Regex in PHP is stored in strings, so having variable permutations like you suggest is certainly possible.
    St. John Johnson : Actually, it might be interesting to write that into a function. Pass in a normal word, and it would reply with the correct regex to detect that word. Only issue I could see is something like W = \/\/ or anything multi-character.
    Chris Lutz : W = !(?:[wW]|\\/\\/)! (in my native Perl). It would be more difficult for things like W with multi-character matches, but certainly possible. A function could easily be written that goes through a string, character-by-character, and looks up a regex to match that character, and then assembles them all into one giant (horrible-looking) regex, which you can use to match that word. However, I don't use PHP often enough to do it. I might do it in Perl if the whim strikes me. Or whatever that expression is supposed to be.
  • Didn't test this thoroughly, but this should do it:

    (\w+)*(?<=[^A-Za-z ])
    
    Chris Lutz : This matches "a " (word followed by spaces).
    dr Hannibal Lecter : My bad :) I've changed it, the extra space should do it.
    Chris Lutz : I would go for tabs too, but this should work.

0 comments:

Post a Comment