[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Regexp arcana



OK, this may be a little off-topic, but I'll tie it in.

(define (do-s s)
  (regexp-match "\"([^\"]*)\"(.*)" s))

The regular expression was take from the implementation of imap in the net
collection.
That's how it's tied in. Now for the arcana :-)

I'm trying to figure out what the dang thing does, so I'm running some
experiments on it.

First of all, from the docs:
"
...
Atom ::= (Regexp)		Match any sub-expression Regexp
...
"

Pardon me for being dense, but what is meant by "Match any sub-expression
Regexp"

So without knowing what the roundy parens do in the regexp, let me attempt
to figure out what
our expression above does:

[^\"] ---> Should "Match any character not in Range" So I suppose that
anything but a quote will do.

Now,

[^\"]* ---> the Kleene star I presume. Hence, anything but a quote, zero or
more times. Stuff like this:

		this is a sentence without a quote in it

I expect it will match the whole dang thing. While
		This has an " embedded quote
A test shows that I get everything up to the quote:
		This has an

"\"[^\"]*\" ---> like before, but the whole deal should have quotes around
it:
		"this is a sententce without a quote"

This next sentence is interesting:

		"this is " a sentence with three quotes"

It will grab everything up to the second quote, but not the whole thing
'cause quotes in the middle aren't allowed:

		"this is "

So what about this?:
"\"([^\"])\""

Let me try an english description: Any substring of the given string such
that the first and last characters of the substring are quotes, but no other
characters in the substring are quotes.

So I write a little routine and try it out:

(define (do-w w)
  (regexp-match "\"([^\"]*)\"" w))

> (do-w "this one has no quotes")
#f

This is what I expected. No quotes and hence no match.

> (do-w "\"quotes around the oustside\"")
("\"quotes around the oustside\"" "quotes around the oustside")

Now I'm confused. The first match was expected, while the cadr of the list
has no quotes! Why do I get the second match?

> (do-w "some stuff and then \" something with quotes around it\" then more
stuff")
("\" something with quotes around it\"" " something with quotes around it")

Once again, I'm hip to the first match but confused by the cadr.

> (do-w "this one \" has three \" embedded quotes \" and more")
("\" has three \"" " has three ")

OK, I'll take the first match. But aren't there other subexpressions which
should match? For example \" embedded quotes \" is also a substring that
fits the regexp, why only the first one?

Thanks.