Sunday, April 3, 2011

Regular expression help

I have the following method in some nemerle code:

private static getLinks(text : string) : array[string] {
     def linkrx = Regex(@"<a\shref=['|\"](.*?)['|\"].*?>");
     def m = linkrx.Matches(text);
     mutable txmatches : array[string];
     for (mutable i = 0; i < m.Count; ++i) {
      txmatches[i] = m[i].Value;
     }
     txmatches
    }

the problem is that the compiler for some reason is trying to parse the brackets inside the regex statement and its causing the program to not compile. If i remove the @, (which i was told to put there) i get an invalid escape character error on the "\s"

Heres the compiler output:

NCrawler.n:23:21:23:22: ←[01;31merror←[0m: when parsing this `(' brace group
NCrawler.n:23:38:23:39: ←[01;31merror←[0m: unexpected closing bracket `]'
NCrawler.n:22:57:22:58: ←[01;31merror←[0m: when parsing this `{' brace group
NCrawler.n:23:38:23:39: ←[01;31merror←[0m: unexpected closing bracket `]'
NCrawler.n:8:1:8:2: ←[01;31merror←[0m: when parsing this `{' brace group
NCrawler.n:23:38:23:39: ←[01;31merror←[0m: unexpected closing bracket `]'
NCrawler.n:23:38:23:39: ←[01;31merror←[0m: unexpected closing bracket `]'

(line 23 is the line with the regex code on it)

What should I do?

From stackoverflow
  • I don't know Nemerle, but it seems like using @ disables all escapes, including the escape for the ".

    Try one of these:

    def linkrx = Regex("<a\\shref=['\"](.*?)['\"].*?>");
    
    def linkrx = Regex(@"<a\shref=['""](.*?)['""].*?>");
    
    def linkrx = Regex(@"<a\shref=['\x22](.*?)['\x22].*?>");
    
    CMS : Just for the record, that feature is called "verbatim string literals".
  • The problem is with the quotation marks, not the brackets. In Nemerle, as in C#, you escape a quotation mark with another quotation mark, not a backslash.

    @"<a\shref=['""](.*?)['""].*?>"
    

    EDIT: Note as well that you don't need the pipe inside the square brackets; the contents are treated as a set of characters (or ranges of characters), with the OR being implied.

  • I'm not Nemerle programmer but i know that yous shoud ALWAYS use XML parser for XML based data and not regexps.

    I guess someone has created DOM or XPath library for Nemerle so you can access either

    //a[@href] via XPath or something like a.href.value via DOM.

    That current regexp doesn't like for example

    <a class="foo" href="something">bar</a>
    

    I didn't test this but it should be more like it

    /<a\s.+?href=['|\"]([^'\">]+)['|\"].+?>/i
    
    Alan Moore : Did the OP say he was parsing XML? All I see is that he's applying a regex to some strings that look like HTML anchor tags. As for the possible presence of other attributes before 'href', I would assume he knows that won't happen; it's his data, after all.
    The.Anti.9 : well he is incorrect with the XML part, but he is right about the regex. it does need to account for a class attrib. there.
    Alan Moore : That's true in general, but we're talking about a specific situation. The more you generalize the regex, the more complicated it becomes. If you give someone a robust, general-purpose regex that's totally incomprehensible to them, are you really helping them?
    kiamlaluno : I think the answer should be specific to the question being asked; IMO, it would be fine to report that it would be better to use a DOM parsing library, after replying to the question asked from the OP. I agree that to generalize too much doesn't help who asked the question; if the OP wanted a more generic answer, then he would have asked a more generic question.

0 comments:

Post a Comment