Skip to main content

skip to main content

developerWorks  >  XML  >

Bring existing data to the Semantic Web

Expose LDAP directories to the Semantic Web with SquirrelRDF

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Intermediate

Wing Yung (wingyung@us.ibm.com), Software Engineer, IBM

01 May 2007

The Semantic Web has promised a new era of easier data integration. Of course, much existing data is already out there in various formats. To convert all of it to RDF (the Semantic Web format) would be a herculean undertaking, so to expose existing data as RDF is preferable. This article introduces core Semantic Web concepts and standards, and explains how to expose an LDAP directory as a service that Semantic Web applications can consume using the open source SquirrelRDF utility.

The Semantic Web has promised a new age of data sharing and integration through use of a common flexible standard, RDF (Resource Description Framework). RDF's properties make it easy to merge data and query across data from different sources. A great amount of data exists in other formats, like XML, relational databases, and LDAP directories. RDF is flexible enough to represent any of these other formats. However, to convert existing data to RDF is a huge and expensive task, and is unnecessary in many cases. Several utilities are available to expose existing data as a Web endpoint queryable through SPARQL, the query language of the Semantic Web. One of these utilities is SquirrelRDF, an open source utility (see Resources for a link) that is part of the Jena Semantic Web framework.

The goal of this article is to explain the process of creating a SPARQL-queryable endpoint for an LDAP directory, introducing important Semantic Web concepts along the way. After setting up the endpoint, I will show how to use some of Jena's Java™ classes to make the endpoint more useful before you query it from a browser-based client using JavaScript.

What is the Semantic Web?

The Semantic Web is an emerging area of technology that is based on a set of standards for representing, querying, and applying rules to data. The core standards are RDF for representation, SPARQL for querying, RDFS for structuring, and OWL for structuring and reasoning. The Semantic Web opens the door to many compelling benefits, including easier data integration, more precise search, better knowledge management, and much more, which has led to the term Semantic Web becoming very heavily loaded (see Resources for more about Semantic Web standards).

RDF serves as the foundation upon which the Semantic Web is built: it is a standard to represent data as a directed, labeled graph. A resource is an entity that is labeled with a URI, which is globally unique and resolvable. The nodes of the graph are resources and literals, and they are connected by directed edges, which are labeled with predicates. The graph can be serialized, listing every edge in the graph. Each edge, called a statement, has a subject (where the edge points from), a predicate (the edge's label), and an object (where the edge points to). Because every statement has a subject, predicate, and object, it is also referred to as a triple. The subject of every statement must be a resource. Predicates are also resources. The object of a statement can be a resource or a literal.

There are several important differences between RDF and XML. First of all, RDF is graph-based, while XML is tree-based. RDF has no explicit ordering because the edges form a set, but XML elements do have ordering. Finally, RDF is a data model that does not include a standard serialization. RDF can be serialized to several formats, including RDF-XML, n3, Terse RDF Triple Language (Turtle; see Resources), and others. The example in Listing 1, which encodes some contact information for two people, is in Turtle.


Listing 1. Example RDF
                
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
<http://wingerz.com/who#wing> a foaf:Person ;
	foaf:name       "Wing C. Yung" ;
	foaf:mbox       <mailto:wing@example.com> ;
	foaf:phone      "1-555-555-5555" ;
        foaf:knows      <http://thefigtrees.net/lee/ldf-card#LDF> .
        
<http://thefigtrees.net/lee/ldf-card#LDF> a foaf:Person ;
        foaf:name       "Lee Feigenbaum" ;
	foaf:mbox       <mailto:lee@example.com> ;
	foaf:phone      "1-555-555-5556" .
	

The first line defines a prefix for the data, so that <http://xmlns.com/foaf/0.1/name> can be abbreviated as foaf:name. You can specify multiple prefixes. Also, Turtle uses the semicolon (;) to designate that the same subject is to be used for the predicate and object that appear in the following line. The predicate a is an abbreviation for the RDF type predicate (<http://www.w3.org/1999/02/22-rdf-syntax-ns#:type>), which is used to designate that a resource is of a certain type.

While RDF is quite freeform, you can use OWL (Web Ontology Language) to impose structure on the data by defining a predicate vocabulary for concepts and rules about the predicates (see Resources for more about OWL). One example of an OWL ontology is Friend of a Friend (FOAF), which is used to express RDF data about contact information and relationships with others (see Resources for more about FOAF). Ontologies can define data classes (like foaf:Person), restrict the types the subject and object (the subject and object of foaf:knows must both be of type foaf:Person), and impose cardinality restrictions on predicates. Use popular ontologies like FOAF whenever possible because common structures and vocabulary will facilitate data integration.

Querying RDF with SPARQL

SPARQL is the standard query language for the Semantic Web. Its syntax is similar to that of SQL, and SPARQL queries consist of a series of triple patterns and modifiers. SPARQL was designed to be capable of querying multiple data sources distributed across the Web.

To try the queries I'll discuss, use the general purpose SPARQL processor (see Resources). Paste http://wingerz.com/dw/listing1.ttl into the Target graph URI field. Once you enter a query and click Get Results, the Turtle file will be fetched and used to query against.

SPARQL has four types of queries:

  • SELECT: Returns a set of variable bindings that satisfy the query (similar to SQL SELECT). Good to produce data for application consumption.
  • CONSTRUCT: Returns a graph (a set of RDF statements). Good to retrieve and transform RDF.
  • ASK: Returns a boolean value if a solution to the query exists.
  • DESCRIBE: Implementation-specific. Takes a resource as input and returns a graph describing that resource.

Listing 2 illustrates a SELECT query.


Listing 2. SPARQL SELECT example
                
Query:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>  
SELECT ?person ?phone 
WHERE {
	?person foaf:name "Wing C. Yung" . 
	?person foaf:phone ?phone
}

Result:
----------------------------------------------------
| person                        | phone            |
====================================================
| <http://wingerz.com/who#wing> | "1-555-555-5555" |
----------------------------------------------------

CONSTRUCT is quite important because it allows you to construct RDF graphs from the results of the SPARQL query. This can come in handy during data merging because the predicates used in the CONSTRUCTed graph are not required to be the same as the ones in the original graph. In Listing 3, you query for the foaf:phone, but a new graph is constructed with a different predicate in its place (wingerz:officephone).


Listing 3. SPARQL CONSTRUCT example
                
Query:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>  
PREFIX wingerz: <http://wingerz.com/>  
CONSTRUCT {
	?person foaf:name "Wing C. Yung" . 
	?person wingerz:officephone ?phone
} WHERE {
	?person foaf:name "Wing C. Yung" . 
	?person foaf:phone ?phone
}

Result:
<?xml version="1.0"?>
<rdf:RDF
    xmlns:wingerz="http://wingerz.com/"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://wingerz.com/who#wing">
    <wingerz:officephone>1-555-555-5555</wingerz:officephone>
    <foaf:name>Wing C. Yung</foaf:name>
  </rdf:Description>
</rdf:RDF>

If you try the CONSTRUCT query, view the page source to see the full response. Unfortunately, the result serialization is in RDF-XML, an inelegant serialization of RDF.

Deriving benefits from storing data in RDF

At first glance, RDF data looks a little bit cumbersome and quite verbose. However, it does have several important benefits.

  • RDF data is self-describing. An RDF graph contains data and structure. Resources are often resolvable. If I was given the RDF from Listing 1 and didn't know what how foaf:phone is defined, I could visit http://xmlns.com/foaf/0.1/phone in a Web browser.

  • The structure of the data has no restrictions. XML is hierarchical, and modeling (and querying) graph structures in relational databases is clunky.

  • SPARQL doesn't require you to design a data access interface. Search APIs (say, one for employees) tend to be quite limited or overly complicated.

  • Data merges easily. Merging data (different graphs) is a simple operation that involves constructing a set of triples that includes all of the triples from the graphs. Globally unique resources reduce ambiguity. Also, you can appy OWL rules to map resources with different URIs to the same URI, if necessary.

  • While structure can be added (with RDFS and OWL), the structure is not explicitly enforced. Adding unspecified properties to a resource does not invalidate the data. It also should not break existing code that interacts with the data.

Getting started with SquirrelRDF

The Jena Semantic Web Framework includes components for RDF storage and query execution. SquirrelRDF is a tool to expose data in relational databases and LDAP to be queryable through SPARQL. This article focuses on the LDAP component of SquirrelRDF. (See Resources for more about these projects.)

The structure of an LDAP directory lends itself very nicely to transformation into RDF. Every LDAP object class has a set of properties. Some of these properties point to literal values (like names) while others point to other objects [like work locations specified by distinguished names (DN)].

Download the source code for SquirrelRDF. You will also need Jena (download the recommended version). The SquirrelRDF home page details everything you will need (see Resources).

After installation, the first step is to find the schema for your LDAP store. If you do not have an LDAP store to experiment with, install OpenLDAP and follow a tutorial on setting up a simple address book (see Resources, though this might not be worth the trouble because the goal of this article is to leverage existing data sources). The schema contains all of the different object classes and their properties. By examining the properties, you can determine the relationships between objects in the directory.

In this example, you'll use a very simple LDAP schema. You have a Person class that includes some basic attributes (like name and phone number) and also points to an OfficeLocation, which is also an object class. An OfficeLocation has a name, two street address fields, a city, and a state. Person also has a manager attribute that points to another Person.

Now, create a mapping from the LDAP schema to RDF. SquirrelRDF uses an RDF file to express this mapping. The lmap:server predicate allows you to specify the location of the LDAP store. The mapping allows the designation of two types of RDF predicates: those that have literal objects and those that have resource objects. Basic attributes like name and phone number will map to predicates with literal objects. For the mapping, you need the LDAP property name and the RDF predicate name. Once you have those, link them to a resource. Suppose you want to map LDAP's cn to RDF's foaf:name. In Listing 4, you create a resource to link it to (:namemapping), then link that to the configuration resource (<>).


Listing 4. SquirrelRDF literal mapping
                
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix lmap: <http://jena.hpl.hp.com/schemas/ldapmap#> .
<> lmap:mapsProp :namemapping .
:namemapping lmap:property foaf:name 
	; lmap:attribute "cn" .

Note that the URI of the resource linking everything together is not important; you can replace it with a blank node, which is a resource with no URI. This is different from a resource with an empty URI, which you also see here. An empty URI is a relative URI, so it will resolve to the location of the Turtle file.

In Listing 5, you replace the resource with a blank node and use square bracket notation to show that the two enclosed statement fragments have the same blank node subject. The square brackets appear after the lmap:mapsProp predicate, meaning that the blank node is the object of that statement.


Listing 5. SquirrelRDF literal mapping with blank node
                
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix lmap: <http://jena.hpl.hp.com/schemas/ldapmap#> .
<> lmap:mapsProp 
	[ lmap:property foaf:name 
	; lmap:attribute "cn" ]

To map an LDAP property that points to a distinguished name, designate the blank node to be of the lmap:ObjectProperty type. This will ensure that the assigned predicate points to a resource (and not a literal). Listing 6 shows the completed configuration file, including two lmap:ObjectProperty properties. By default, an LDAP property will be assigned to a predicate that has a literal resource. Note that it is good practice to use existing predicates whenever possible; in this case, you use some vocabulary from the FOAF ontology. There is a vocabulary for locations as well, but you do not use it here.


Listing 6. SquirrelRDF (dw.ttl) mapping
                
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix lmap: <http://jena.hpl.hp.com/schemas/ldapmap#> .
@prefix people: <http://wingerz.com/people#> .
@prefix ol: <http://wingerz.com/officelocation/rdf#> .

<> a lmap:Map ;
	lmap:server <ldap://wingerz.com:389/ou=people,o=wingerz.com> ;

	# Person properties
	lmap:mapsProp 
		[ lmap:property foaf:name 
		; lmap:attribute "cn" ; ] ;
	lmap:mapsProp 
		[ lmap:property foaf:phone 
		; lmap:attribute "telephoneNumber" ; ] ;
	lmap:mapsProp 
		[ lmap:property people:ol 
		; lmap:attribute "officeLocation" 
		; a lmap:ObjectProperty ; ] ;
	lmap:mapsProp 
		[ lmap:property people:manager 
		; lmap:attribute "manager" 
		; a lmap:ObjectProperty ; ] ;
	
	# OfficeLocation properties
	lmap:mapsProp 
		[ lmap:property ol:address1 
		; lmap:attribute "address1" ; ] ;
	lmap:mapsProp 
		[ lmap:property ol:address2 
		; lmap:attribute "address2" ; ] ;
	lmap:mapsProp 
		[ lmap:property ol:city 
		; lmap:attribute "city" ; ] ;
	lmap:mapsProp 
		[ lmap:property ol:state 
		; lmap:attribute "state" ; ] ;
	lmap:mapsProp 
		[ lmap:property ol:postalCode 
		; lmap:attribute "postalCode" ; ] ;
.

Trying the mapping

SquirrelRDF provides a command-line utility for running SPARQL queries. This is a good way to test things before you go any further. Store the query in Listing 7 in a file.


Listing 7. Test query (test.rq)
                
PREFIX foaf: <http://xmlns.com/foaf/0.1/>  
SELECT ?person ?phone 
WHERE {
	?person foaf:name "Wing C Yung" . 
	?person foaf:phone ?phone
}

Run squirrelrdf.Query with dw.ttl (the name of the mapping file) and test.rq (the name of the query file) as arguments. You should get a URI for the person and a phone number.

For something a little bit more ambitious, try Listing 8, which queries for all of the employees who work in a particular state (this might be quite large). Note that this query pulls out the person's name and city.


Listing 8. Test query (test2.rq)
                
PREFIX foaf: <http://xmlns.com/foaf/0.1/>  
SELECT ?person ?city ?name 
WHERE{
	?person foaf:name ?name . 
	?person people:ol ?officelocation .
	?officelocation ol:state "MA" ;
		ol:city ?city
}

Setting SquirrelRDF up as a service

You'll want to set up SquirrelRDF as a HTTP service so a Web client can use it. SquirrelRDF is distributed with a simple Servlet for exposing an HTTP service endpoint (squirrelrdf.Servlet). Set it up in your favorite Servlet container. To run the test, URL-encode the query so that it can be passed as the query parameter in the URL. Visit http://localhost:8080/squirrel?query=... (the exact location will depend on your server configuration; see Resources).

The default Servlet is quite minimal; it can only answer SELECT queries, and it returns the results as a result set represented in a standard XML format. To make the SPARQL endpoint more useful, allow it to support some additional features.

Executing queries

Before you add any more functionality, let's set up a better way to test the endpoint, because URL-encoding queries and constructing URLs by hand is cumbersome. The general-purpose SPARQL processor used in the first examples won't work because it fetches entire RDF graphs on which to execute queries; our SquirrelRDF service is a SPARQL endpoint that can answer queries itself (by converting them to LDAP queries). Create a simple HTML form to write and submit queries. One example is a more full-featured SPARQLer (see Resources). Click the Graphs tab to set your endpoint (probably something like http://localhost:8080/squirrel). Click Change. This sets the endpoint for query submission. Try testing one of the queries you saw earlier in this article.

Supporting other types of SPARQL queries

As mentioned before, CONSTRUCT queries are quite important, especially for data that is not natively stored in RDF. Currently, the Servlet only supports SELECT queries, but the underlying query engine can execute all four types of queries. To extend the existing Servlet, change the doQuery() method, as shown in Listing 9. The com.hp.hpl.jena.query.Query object knows what sort of query it is, so you can use that to determine what com.hp.hpl.jena.query.engine1.QueryEngine should do. Note that different query types do not return the same type of data: ASK returns a boolean, CONSTRUCT and DESCRIBE return graphs (technically, they return Models, which wrap graphs), and SELECT returns a result set.


Listing 9. Handling other types of SPARQL queries
                
Query q = QueryFactory.create(theQuery, ".", Syntax.syntaxSPARQL);
int queryType = q.getQueryType();
Model m = null;
switch(queryType){
case Query.QueryTypeAsk:
	boolean b = qe.execAsk();
	String str = ResultSetFormatter.asXMLString(b);
	resp.setHeader("Content-Type", "application/xml");
	resp.getOutputStream().write(str.getBytes());
	break;
case Query.QueryTypeConstruct:
	// Gets a model.
	m = qe.execConstruct();
	resp.setHeader("Content-Type", "application/rdf+xml");
	// A Model can serialize itself. 
	// The serialization format can be passed in as an argument,
	// default is to write out as RDF/XML.
	m.write(resp.getOutputStream());
	break;
case Query.QueryTypeDescribe:
	m = qe.execConstruct();
	resp.setHeader("Content-Type", "application/rdf+xml");
	m.write(resp.getOutputStream());
	break;
case Query.QueryTypeSelect:
	ResultSet results = qe.execSelect();
	resp.setHeader("Content-Type", "application/xml");
	ResultSetFormatter.outputAsXML(resp.getOutputStream(), results);
	break;
}

Getting JSON output

JSON is a data serialization format that you can eval by a JavaScript interpreter to create an object. JSON is popular as a Web data access format because it allows JavaScript browser-based applications to consume structured data without having to parse XML (see Resources). Listing 10 shows the query results from Listings 2 and 3 as JSON.


Listing 10. JSON query results
                
{
  "head": {
    "vars": [ "person" , "phone" ]
  } ,
  "results": {
    "distinct": false ,
    "ordered": false ,
    "bindings": [
      {
        "person": { "type": "uri" , "value": "http://wingerz.com/who#wing" } ,
        "phone": { "type": "literal" , "value": "1-555-555-5555" }
      }
    ]
  }
}

The com.hp.hpl.jena.query.ResultSetFormatter utility class is used to output SPARQL query results. In Listing 9 it is used to output results as XML (for both SELECT and ASK queries). You can also use to write results into RDF. Not surprisingly, you can also use this utility class to output the SELECT result bindings as JSON, as shown in Listing 11.


Listing 11. Adding JSON support for SPARQL queries
                
String output = req.getParameter("output");
if (output.equals("json")){
	resp.setHeader("Content-Type", "application/json");
	ResultSetFormatter.outputAsJSON(resp.getOutputStream(), results);
}

Supporting XSLT

Another approach to extracting data from the XML of a SPARQL result set is to apply an XSLT. Listing 12 shows an example of SPARQL result set XML. The <head> element contains all of the variables and the results element contains a list of results, each of which contains bindings.

You can add an XSLT service to your server, process the query results through an external XSLT service, or add a link to the stylesheet in the XML output. To add a link to the stylesheet in the XML output, pass the com.hp.hpl.jena.query.ResultSetFormatter an extra argument with the URI of the XSLT, as shown in Listing 13. Note that this will not perform the transformation for you; it includes a link to the XSLT in the generated document. When the document is read by a XSLT processor (like a modern Web browser), the transformation will be applied (see Resources).


Listing 12. SPARQL result XML
                
<?xml version="1.0"?>
<sparql
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:xs="http://www.w3.org/2001/XMLSchema#"
    xmlns="http://www.w3.org/2005/sparql-results#" >
  <head>
    <variable name="person"/>
    <variable name="phone"/>
  </head>
  <results ordered="false" distinct="false">
    <result>
      <binding name="person">
        <uri>http://wingerz.com/who#wing</uri>
      </binding>
      <binding name="phone">
        <literal>1-555-555-5555</literal>
      </binding>
    </result>
  </results>
</sparql>


Listing 13. Including an XSLT
                
String stylesheet = req.getParameter("xslt");
if (stylesheet != null)
	ResultSetFormatter.outputAsXML(resp.getOutputStream(), results, stylesheet);
else
	ResultSetFormatter.outputAsXML(resp.getOutputStream(), results);

Querying from JavaScript

At some point, you will want to build your own Web user interface (UI) or run SPARQL queries from your JavaScript application. For a freely available SPARQL JavaScript library, see Resources. It primarily works with JSON results from SPARQL SELECT queries. Fortunately, you enabled this support for your SPARQL service. One of the most powerful ideas behind the library is the idea of transformers -- the JSON output from a SPARQL endpoint contains some data that will probably not be used by a typical client, like some of the datatype information. A transformer transforms the JSON output into even more natural objects for easy consumption.

Share this...

digg Digg this story
del.icio.us Post to del.icio.us
Slashdot Slashdot it!

Conclusion

With a little bit of effort, you made LDAP queryable with SPARQL. Making data sources look like RDF makes it much easier for an application to integrate this data with data from other sources, and it gives you a powerful means to query the data. While this article looked at exposing LDAP, you can handle other data sources and formats in a similar manner. You don't need to abandon existing data modeling and storage technologies in order to embrace the Semantic Web and its benefits.



Resources

Learn

Get products and technologies
  • SquirrelRDF project: Download and make relational database and LDAP queryable with SPARQL.

  • Jena project: Build Semantic Web infrastructure and applications with this Java framework that provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine.

  • Jena ARQ project: Get a SPARQL query engine that supports the SPARQL RDF Query language.

  • OpenLDAP project: Try an open-source implementation of LDAP.

  • Jetty project: Get a Servlet container with this open source, standards-based, full-featured Web server implemented entirely in Java .

  • Tomcat project: Try a Servlet container that is used in the official Reference Implementation for the Java Servlet and JavaServer Pages technology.

  • SPARQL query tool: Experiment with this demo provided by the author.

  • SPARQL JavaScript library: Try this useful library for browser-based apps.

  • IBM trial software: Build your next development project with trial software available for download directly from developerWorks.


Discuss


About the author

Photo of Wing C Yung

Wing Yung is a software engineer. He has worked on Semantic Web technologies for the past few years. His group recently open-sourced several Semantic Web infrastructure components. He writes on his blog regularly, occasionally about technology.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top