posted on Wednesday, September 27, 2006 4:02 PM by timbarcz

Lexical Scanning and Parsing Part 2

Yesterday in part one I discussed the honors project I did as a senior in college.  I left you with the statement that I had recently had the opportunity to use that "only academic" knowledge in the real world.

The company I work for builds a content management system.  I was recently asked to build a way to import content.  Now that's simple enough right? Right.  What's the difficulty and how does this fit in with Part 2 of Lexical Scanning and Parsing you ask?  The difficulty lies in being able to tell the CMS what the content is about.  Being that our company believes this idea of the World Wide Web we've built a way to interrelate pages based on what's in them.  If you have a page about a certain topic, then we think you might want to know about other articles that relate to the one you're looking at.  It's very similar Amazon's 'Customers who searched for "care bears" also expressed interest in: ' feature.  When I'm looking at buying the smash 1986 hit "Care Bears Movie II: New Generation" I want to know what other people like me also liked.

Yes, the difficulty, so how can you tell what a give blob of content is about?  You could read it and manually tell the system.  But when the pages start number over the 10's (try over 4,000) then it becomes a tedious, not-so-fun task.  So I set out to write a language by which we could figure out what to tell the system about the content blob.

First, we take a set of keywords about the content blob (luckily for me they're provided).  Currently though I'm reading some dissertation by some smart guy in Tokyo with some crazy algorithm to figure the keywords out by itself.  That's for version dos.  Back to version one.  I take these keywords and scan them using regular expressions.  I set up an expression that if evaluates to true, I tell our CMS system.

In plain English, I want to do the following:

If this article contains the words "care bears" then this article is about the Care Bears.

Using my knowledge of lexical scanning and parsing I set up a psuedo language that really evaluates boolean expressions.  It has support for AND, OR, XOR, NOT, and grouping parenthesis.  Precedence of the operators is also considered, just like C#.  Here is the Backus-Naur Form for how things are evaluated

 <b-expression> ::= <b-term> [<orop> <b-term>]*
 <b-term>       ::= <not-factor> [<andop> <not-factor>]*
 <not-factor>   ::= [<notop>] <b-factor>
 <b-factor>     ::= $(<b-expression>$) | <regular-expression>
 <orop>   ::= $| | $^
 <andop>  ::= $&
 <notop>  ::= $!

As you can hopefully see from the BNF, the following operators and grouping statements are defined as follows:

OR - $|

AND - $&

XOR - $^

NOT - $!

Begin Group - $(

End Group - $)

The reason operators are so cryptic is that they will separate regular expressions, so I needed combinations of characters that wouldn't be recognized as a regular expression.  I couldn't do '|' as OR because | is alternation in regular expressions.  '^' can be used to represent the start of the line or character negation.  And finally '(' and ')' are used  for grouping.  I added the '$' to the definition of each operator since $ represents the end of line in regular expression, $[plus any character] should never appear in a single line regular expression.  If I've lost you I'm sorry, trust me on this.

Here is an example of a real life expression one of our guys wrote:

$(Cat00138 $& $!speech$) $| $(Cat00192  $& $!eyes$) $| KidsHealthCat20030 $| $($(ear $| nose $| nasal $| throat$)  $& $!$(hearing $| speech $| piercing$)$)

Basically, it's says:

If this article is in (Category00138 AND NOT speech) OR (Category00192 AND NOT eyes) OR Category20030 OR ((ear OR nose OR nasal OR throat) AND NOT (hearing or speech OR piercing))

Though it's kind of complex, this system allows us flexibility like never before.  We can set up any number of regular expressions and join them together with other regular expressions using boolean operators and grouping.  If the whole expression evaluates to true for a given expression, we know quite a bit about an article.

The other positive is this system is fast.  I can scan all 4,000 documents, running about 100+ expressions on each in a few seconds.

If you've read this far you deserve a cookie.  If you've actually understood, you deserve two.  If you didn't follow, please let me know through the comments and I'll explain better.  I want people to understand, this is some cool stuff, and it points out that "worthless academic" knowledge isn't always worthless and academic.

 

Comments