OmegaSoft Homepage  |  Products & Services  |  About OmegaSoft


 

OmegaSoft Development Blog

CFG and DCG notation in Natural Language Processing

11 January 2007 - 04:19 PM

 

This really has very little to do with anything happening at OmegaSoft, but recently I've been doing my own research into a little natrual language processing rule.

 

Called CFG and DCG notation. Prolog fans will probably already know a little about this subject anyway.

 

Natual language is obviously a mixed bag when it comes to analysing, because it can change drastically.

 

One keen area of research is POS tagging, which stands for Part of Speech (POS). This is where you break a sentence or a collection of word-forms into its grammatical units.

 

Take for example:

 

"the cat saw the dog"

 

The POS for this sentence is:

 

[determiner, noun, verb, determiner, noun]

 

That is (in a simple overview);

the = determiner (det)

cat = noun (noun)

saw = verb (verb)

the = determiner (det)

dog = noun (noun)

 

Note the short had in brackets, as short hand is often used in this type of research.

 

But actually, the sentence can be simplified!

 

The sentence is actually a 'noun phrase' followed by a 'verb phrase'.

 

Noted by: S -> NP, VP

 

So, whats the noun phrase? It is: "the cat" (phrase describing something with a name)

 

The verb phrase is: "saw the dog" (phrase describing what the agent (the cat) is actually doing.

 

Note, S used above means start symbol - or [simply] sentence.

 

But also see, that the VP also has a noun phrase in it, 'the dog'.

 

So this can be broken down into both a verb 'saw' and a noun phrase 'the sog'

 

What does this look like in the form of something called a 'parse tree'.

 

 

This syntax three is half way to CFG notation. But obviously we need to describe the tree in word form, using a collection of simbols as computers don't really understand pretty pictures at this level.

 

So this is what the notation look like. Have a glance at the figure above and see how each symbol on the tree is described in the CFG as a set of 'rules'.

 

S -> NP VP

NP -> det n

VP -> v NP

det -> a

det -> the

n -> cat

n -> dog

v -> saw

 

(Note, that although the structure in the tree had two NPs, they both branched identically. So we don't really need to include a second identical rule. As one rule fits all (where the structure is the same).

 

DCG notation (this is the machine readable form for a prolog application) is fairly similar to CFG, but does differ slightly. See below.

 

DCG:

 

S --> np, vp

np --> det, n

vp --> v, np

det --> [a]

det --> [the]

n --> [dog]

n --> [cat]

v --> [saw]

 

Esentially, this breaking down of words helps a computer understand what the sentence actually means, by looking at the POS rather than what the words characteristically say.

 

POS tells quite a lot about a phrase, and what the phrase is trying to say.

 

Seb Harvey

 

Back to blog



© Copyright 2008  |  Terms of Use  |  OmegaSoft Homepage  |  Feedback