 | Level: Intermediate Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought Inc.
06 Dec 2005 In previous installments of this column, Uche Ogbuji introduced the WordNet natural language database, and showed how to represent database nodes as XML and serve this XML though the Web. In this article, he shows how to convert this XML to an RDF representation, and how to use the WordNet XML server to enrich search engine technology.
In two previous installments -- "Querying WordNet as XML" and "Serving up WordNet as XML" -- I presented code for XML-based processing of the useful WordNet project. WordNet represents an important parallel to the central theme of this column: Thinking XML covers semantics in XML; WordNet provides a sketchy model of the semantics of natural language itself. The word sketchy here is not pejorative because millennia of philosophy of language have shown how hard (or even impossible) it is to come up with a truly rigorous model of natural language. The most widespread system in place for modeling the semantics of Web-based systems that include XML is RDF. This article thus closes the loop by presenting an RDF representation of the XML WordNet system I have presented so far. I also show how you can use the WordNet XML representation and server in search engine enhancement.
RDF from XML
To create an RDF/XML format, you can find all the necessary information content for WordNet in the XML representation discussed in the first article of this mini-series. I start in Listing 1 with the example of one of the synonym sets (synsets) for the word "selection".
Listing 1. First serialized synset associated with the word "selection"
<?xml version="1.0" encoding="UTF-8"?>
<noun xml:id="152253">
<gloss>the act of choosing or selecting; "your choice of colors was
unfortunate"; "you can take your pick"</gloss>
<word-form>choice</word-form>
<word-form>selection</word-form>
<word-form>option</word-form>
<word-form>pick</word-form>
<hypernym part-of-speech="noun" target="32816"/>
<frames part-of-speech="verb" target="653781"/>
<frames part-of-speech="verb" target="656613"/>
<frames part-of-speech="verb" target="652154"/>
<hyponym part-of-speech="noun" target="152613"/>
<hyponym part-of-speech="noun" target="152749"/>
<hyponym part-of-speech="noun" target="152898"/>
<hyponym part-of-speech="noun" target="153642"/>
<hyponym part-of-speech="noun" target="154057"/>
<hyponym part-of-speech="noun" target="170871"/>
<hyponym part-of-speech="noun" target="173378"/>
</noun>
|
Now comes the hard part: deciding which RDF schema to adopt. There has been a lot of activity around RDF representations of WordNet. The closest thing to an official effort is an early draft document "Wordnet in RDFS and OWL" published in mid-2004 by the W3C. However, this work remains very incomplete, and other people and organizations, including Chilean researcher Alvaro Graves,
have worked to advance it. (See Resources for more on these initiatives.) I decided to create a lightweight RDF/XML representation that's designed for compatibility with the W3C effort, but without using the murkier corners of that effort. The equivalent of Listing 1 in this format is as shown in Listing 2.
Listing 2. An RDF/XML representation of Listing 1
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:wn="http://uche.ogbuji.net/tech/rdf/wordnet/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<wn:SynSet rdf:about="noun/152253">
<wn:glossaryEntry>
the act of choosing or selecting; "your choice of colors was unfortunate";
"you can take your pick"
</wn:glossaryEntry>
<wn:wordForm>choice</wn:wordForm>
<wn:wordForm>selection</wn:wordForm>
<wn:wordForm>option</wn:wordForm>
<wn:wordForm>pick</wn:wordForm>
<wn:hypernym rdf:resource="noun/32816"/>
<wn:frames rdf:resource="verb/653781"/>
<wn:frames rdf:resource="verb/656613"/>
<wn:frames rdf:resource="verb/652154"/>
<wn:hyponym rdf:resource="noun/152613"/>
<wn:hyponym rdf:resource="noun/152749"/>
<wn:hyponym rdf:resource="noun/152898"/>
<wn:hyponym rdf:resource="noun/153642"/>
<wn:hyponym rdf:resource="noun/154057"/>
<wn:hyponym rdf:resource="noun/170871"/>
<wn:hyponym rdf:resource="noun/173378"/>
</wn:SynSet>
</rdf:RDF>
|
The W3C has not yet set up a namespace for WordNet in RDF, so I picked http://uche.ogbuji.net/tech/rdf/wordnet/ for now. I frame pointers between synsets using relationships with the same name as the WordNet pointers (hypernym, frames, etc.). The W3C interest group seems to lean towards specialized property names (such as hypernymOf), which I think is a bad idea because it requires a revision of the schema every time WordNet is updated with new pointer-based relationships.
Making the transform
I have long been interested in determining how best to use XML itself as a source format for RDF models, rather than requiring everything to be in the very clumsy RDF/XML format. I touched on this in a previous installment (see Resources). The most practical approach I've found is to use XSLT to transform XML into RDF/XML, which is then imported into an RDF model. Ideally, you would do so using tools such as 4Suite's repository (see Resources) that don't make you worry about the intermediate RDF/XML at all. You've already seen (in Listing 2) the RDF/XML format to be used in WordNet, so all you need is the XSLT to go from the XML to the RDF/XML format. Listing 3 is just such a transform.
Listing 3. XSLT for converting from WordNet XML to RDF/XML format
<?xml version="1.0" encoding="utf-8"?>
<xsl:transform version="1.0"
xmlns:xsl = "http://www.w3.org/1999/XSL/Transform"
xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:wn = "http://uche.ogbuji.net/tech/rdf/wordnet/"
xml:base = "http://uche.ogbuji.net/tech/rdf/wordnet/"
>
<xsl:output indent="yes"/>
<xsl:template match="/">
<rdf:RDF>
<xsl:apply-templates/>
</rdf:RDF>
</xsl:template>
<xsl:template match="noun|verb|adjective|adverb">
<wn:SynSet rdf:about="{name()}/{@xml:id}">
<xsl:apply-templates/>
</wn:SynSet>
</xsl:template>
<xsl:template match="gloss">
<wn:glossaryEntry><xsl:copy-of select="node()"/></wn:glossaryEntry>
</xsl:template>
<xsl:template match="word-form">
<wn:wordForm><xsl:value-of select="."/></wn:wordForm>
</xsl:template>
<xsl:template match="*">
<xsl:element namespace="http://uche.ogbuji.net/tech/rdf/wordnet/"
name="wn:{name()}">
<xsl:attribute namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
name="rdf:resource">
<xsl:value-of select="concat(@part-of-speech, '/', @target)"/>
</xsl:attribute>
</xsl:element>
</xsl:template>
</xsl:transform>
|
 |
Giving search a jolt
In a much earlier article,
I showed you how to use an RDF-based WordNet database to add some natural language power to application-specific search engines. I assembled an RDF representation such as that in Listing 2 from all the synsets in WordNet and performed a similar monolithic query of the resulting database. This time, however, I prefer to show a different approach and present code that recursively queries the XML from the WordNet server developed in the previous installment of this series.
The problem is that given a search for a word, you want to be able to extend that search to particular, closely-related words. So if the user searches for the word "selection", the code should return results that contain words like "vote", "choice", and "ballot". This is a matter of querying the WordNet server starting from the search word, and following pointers to related words, and then continuing recursively. For this demonstration, I restrict the pointers to hyponyms, which has the effect of generally finding other words that represent more specific concepts related to the search word.
This problem requires one refinement over the code presented in my last installment -- specifically, I need the WordNet server to return raw XML representing synsets, and not just full word forms. Listing 4 is a specialization of the CherryPy Web server presented as Listing 1 of the last article. It allows you to get raw synset XML from URLs based on WordNet pointers, for example http://localhost:8080/raw/pointer/noun/5955443.
Listing 4. Specialization of the CherryPy Web server presented in previous column
import cherrypy
from picket import Picket, PicketFilter
from wnxmllib import *
class root:
_cpFilterList = [ PicketFilter(defaultStylesheet="viewword.xslt") ]
class wordform_handler:
def __init__(self, applyxslt=False):
self.applyxslt = applyxslt
return
@cherrypy.expose
def default(self, word):
synsets = serialized_synsets_for_word(word)
result = ''.join(synsets) #Concatenate strings in result list
#Wrap up the XML fragments into a full document
wordxml = '<word-senses text="'+word+'">'+result+'</word-senses>'
if self.applyxslt:
picket = Picket()
picket.document = wordxml
return picket #apply the XSLT and return the result
return wordxml
class pointer_handler:
def __init__(self, applyxslt=False):
self.applyxslt = applyxslt
return
@cherrypy.expose
def default(self, pos, target):
synset = getSynset(pos, int(target))
synsetxml = serialize_synset(synset)
if self.applyxslt:
picket = Picket()
picket.document = synsetxml
return picket #apply the XSLT and return the result
return synsetxml
cherrypy.root = root()
cherrypy.root.view = wordform_handler(applyxslt=True)
cherrypy.root.raw = wordform_handler()
cherrypy.root.pointer = pointer_handler(applyxslt=True)
cherrypy.root.raw.pointer = pointer_handler()
#Disable debugging messages in Web responses
cherrypy.config.update({'logDebugInfoFilter.on': False})
cherrypy.server.start()
|
Listing 5 is client code that takes a word and returns the hyponym chain as a Python set. It requires the WordNet XML server running with the updated code in Listing 4, and it requires 4Suite XML version 1.0b3 (see Resources). You can use another XML/XPath processing library, if you update the imports and API accordingly.
Listing 5. Client code that takes a word and returns the hyponym chain as a Python set
import sets
import urllib
from Ft.Xml import Parse
BASEURI = 'http://localhost:8080/'
def get_hyponym_chain(word):
'''
returns a list with all the hyponyms of a word, the hyponyms
of those hyponyms, and so on, recursively
'''
accumulator = [] #Storage list for the hyponym chain
def process(xml):
'''
extract the hyponym chain from a DOM node. Common processing for
word-form and synset XML
'''
hyponyms = xml.xpath(u'//hyponym')
wforms = [e.xpath(u'string()')
for e in xml.xpath(u'//word-form')]
accumulator.extend(wforms)
for hyponym in hyponyms:
pos = hyponym.xpath(u'string(@part-of-speech)')
target = hyponym.xpath(u'string(@target)')
expand_hyponyms(pos, target, accumulator)
return
def expand_hyponyms(pos, target, accumulator):
'''
follow a pointer and extract the hyponym chain from the resulting XML
'''
synsetxml = Parse(BASEURI + 'raw/pointer/' + pos + '/' + target)
process(synsetxml)
return
#escape any spaces or other problem characters in the word
word = urllib.quote(word)
wordxml = Parse(BASEURI + 'raw/' + word)
process(wordxml)
return sets.Set(accumulator) #eliminate dupes
if __name__ == '__main__':
#If invoked from the command line, get the hyponym chain from the
#word given in the command-line parameters
import sys, pprint
print get_hyponym_chain(' '.join(sys.argv[1:]))
|
You may need to edit BASEURI according to how you deploy your running WordNet XML server. By invoking this program with "selection" as the command-line parameter, you get the following result -- a set of Unicode strings (edited for formatting):
Set([u'selection', u'move', u'co-option', u'cut', u'secret ballot',
u'manoeuvre', u'juke', u'casting', u'survival', u'Haftarah', u'choice',
u'cutting', u'safe harbor', u'suicide pill', u'shark repellent',
u'scorched-earth policy', u'designation', u'delegacy', u'casting vote',
u'press cutting', u'demarche', u'epigraph', u'security', u'balloting',
u'precaution', u'tactical maneuver', u'fast one', u'split ticket',
u'clipping', u'stratified sampling', u'artifice', u'excerpt', u'greenmail',
u'election', u'parking', u'naming', u'write-in', u'extract', u'recognition',
u'pocket veto', u'survival of the fittest', u'decision', u'track', u'quote',
u'Haphtarah', u'shtik', u'schtik', u'maneuver', u'analecta', u'veto',
u'citation', u'analects', u'step', u'representative sampling', u'favorite',
u'colouration', u'sortition', u'drawing lots', u'pick', u'schtick', u'gambit',
u'stratagem', u'misquotation', u'cumulative vote', u'sampling', u'guard',
u'laying on of hands', u'fake', u'vote', u'gimmick', u'countermine', u'ploy',
u'intention', u'willing', u'call', u'lucky dip', u'way', u'footwork',
u'option', u'Haphtorah', u'feint', u'determination', u'safeguard',
u'security measures', u'quotation', u'press clipping', u'ruse', u'trick',
u'Haftorah', u'porcupine provision', u'shtick', u'measure', u'straight ticket',
u'twist', u'mimesis', u'appointment', u'volition', u'random sampling',
u'ordinance', u'ballot', u'poison pill', u'pac-man strategy',
u'proportional sampling', u'conclusion', u'multiple voting',
u'golden parachute', u'favourite', u'assignment', u'block vote', u'device',
u'nomination', u'coloration', u'casting lots', u'newspaper clipping',
u'co-optation', u'ordination', u'natural selection', u'tactical manoeuvre',
u'pleasure', u'voting', u'resolution', u'countermeasure'])
|
One important issue to note: When I tried the big-RDF-database approach to such a synonym-driven search four years ago, I had a lot of trouble with performance. It took over two minutes to complete the example search for "selection" and hyponyms. When I tried this again on my present work computer, that approach took about a minute. However, the new approach presented in this article, using the WordNet XML server, takes less than a second to execute on the same computer, despite the fact that it involves several local HTTP requests. This is because the WordNet XML server approach takes advantage of specialized hashing and indexing used by the WordNet database, rather than shredding that into a more general RDF database. It deals with the XML on a small scale through a divide-and-conquer approach, rather than a monolithic query. Again, sometimes RDF is a good model for processing, but not the most practical syntax or even storage form for universal DBMS.
Wrap up
This wraps up my three-part discussion of how to use the WordNet natural language semantic database in XML and RDF applications. I hope I provided some tools that you can build on in such applications. Princeton's English language WordNet project is a tremendous public service, and is progressing impressively. I was disappointed to see that similar efforts in other languages are slow to emerge. EuroWordNet (see Resources), which was to do the same for several European languages has been stalled since 2001, and I can't find any equivalent for East Asian or other language groups. If you are aware of any such initiatives, or have any other thoughts on the topic, please do share them on the Thinking XML discussion forum.
Resources Learn
-
Review the precursor articles to this one, "Querying WordNet as XML" (developerWorks, January 2005) and "Serving up WordNet as XML" (developerWorks, August 2005). See the Resources in these articles for additional background. For a review of an earlier exploration into processing WordNet as a monolithic RDF database, see "Basic XML and RDF techniques for knowledge management, Part 3" (developerWorks, November 2001).
- Explore WordNet, the
English lexical database project. Among the related projects are database interfaces for many languages and platforms, most of which require that you separately download and install the WordNet 2.0 database package. If you've already used WordNet in some way, take a look at the changes in version 2.0.
-
Learn more about WordNet in RDF. Several people associated with the W3C have made a series of efforts to port the WordNet database to RDF, with the latest work embodied in the RDF scheme project "Wordnet in RDFS and OWL". This work is still incomplete, but there have already been derivative projects such as Alvaro Graves's
"An RDF Representation of WordNet."
-
See "Basic XML and RDF techniques for knowledge management, Part 7" for more thoughts on keeping the useful RDF model independent of its problematic XML representation (developerWorks, July 2002).
-
Keep an eye on the promising but stalled EuroWordNet project, described as "[b]uilding a multilingual database with wordnets for several European languages."
- Find more XML resources on the developerWorks XML zone, including previous installments of the Thinking XML column, one of which is my earlier article on WordNet and RDF. If you have comments on this article, or any others in this column please post them on the Thinking XML forum.
- Learn how you can become an IBM Certified Developer in XML and
related technologies.
Get products and technologies
Discuss
About the author  | 
|  | Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia or contact him at uche@ogbuji.net. |
Rate this page
|  |