 | Level: Introductory Brian Goetz (brian@quiotix.com), Principal Consultant, Quiotix
22 Mar 2005 XQuery is a W3C standard for extracting information from XML documents, currently spanning 14 working drafts. While the majority of interest in XQuery is centered around querying large bases of semi-structured document data, XQuery can be surprisingly effective for some much more mundane uses as well. In this month's Java theory and practice, columnist Brian Goetz shows you how XQuery can be used effectively as an HTML screen-scraping engine.
Last month, Java™ technology guru Sam Pullara was showing me his latest
Java-enabled phone, the Nokia 6630. It is crammed full of technology
-- an embedded JVM, GPRS, Bluetooth -- but it suffers from the same
problem that plagues all smart phones -- limited screen real estate.
Some Web sites have support for phone-based browsers, and embedded
browsers try to render pages effectively on small screens, but trying
to view a typical Web page on a phone screen is a lot like trying to
squeeze an elephant into the back seat of your car (to the
dissatisfaction of everyone -- you, the car, and the elephant). Sam
had built a simple, elegant solution for screen-scraping data from his
favorite Web sites and reformatting them for small-screen display.
A novel approach
You can use a number of approaches to extract data from HTML
documents. I really liked the approach Sam took, which was to use
XQuery as both a screen-scraping tool (to extract the relevant data
from the pages) and as a stylesheet tool (to reformat the data so it
fits nicely on the page without scrolling). With a small amount of
infrastructure and some pretty simple XQuery expressions, it became
possible to extract the relevant data -- such as traffic, weather, and
financial quotes -- out of numerous data sources and
display it nicely on the phone.
I've often been in the situation where screen-scraping HTML pages
seemed a sensible solution for a particular problem, but there are
very few Java-based toolkits for screen scraping. Many HTML
parsing tools are available, but they generally lack sufficient
abstractive capability (making screen-scraping code messy), are
limited by the widespread use of poorly conforming HTML, and deal
poorly with dynamically generated pages whose structure may change
over time.
To bridge the gap between poor-quality HTML and the rich set of
XML-processing tools, you first need to convert the HTML into XML. A
number of tools can help you do this; the JTidy toolkit does a good job
and makes it easy. JTidy is designed to read-in typical-quality (that
is to say, bad) HTML and output something cleaner (you have a choice
of options), and also provides a DOM interface for traversing HTML
documents that can be fed to an XML parser. The code in Listing 1
will read in an HTML document from an InputStream and
generate a DOM representation of the document:
Listing 1. Code to convert HTML into an XML-compatible DOM with JTidy
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(inputStream, null);
|
With this simple transformation, you can process almost any Web page
as an XML document, and you can apply your favorite XML tools for
extracting data -- SAX, XSL, XPath -- you name it. While XSL might be
the obvious choice, as it is designed for extracting information from
XML documents and transforming it for presentation, XSL has a
significant learning curve if you don't already know it, and even the
simplest XSL transformations can be annoyingly complicated. XPath is
a good candidate for the extraction part -- which XSL and XQuery both
use for content selection -- and you could easily use XPath to pull
out the data you need and then format the HTML yourself, but XQuery
makes it even easier.
XQuery: A (ridiculously) brief tour
XQuery was designed for extracting data from potentially very large
XML datasets. The input dataset need not be an XML document, though
it could be -- but it could also be a collection of documents that have
been indexed and stored in an XML database, or even a set of tables in
a relational database. Like SQL, XQuery contains functions for
extracting, summarizing, aggregating, and joining data from multiple
datasets.
Just like presentation template languages, such as JSP, ASP, or
Velocity, XQuery combines elements from two domains -- the
presentation domain and a computational domain -- into a single
combined syntax. The result is that any XML document is already a
valid XQuery expression, which evaluates to itself. There are also
language statements, such as "for" and "let," which can be intermixed
with XML elements.
Listing 2 shows a sample XML document,
bib.xml, which represents a bibliography of books. I'll
show you a few quick XQuery expressions to give you a flavor of what
XQuery can do, and then move on to the screen-scraping examples.
Covering the syntax and use cases of XQuery could take hundreds of
pages -- see the Resources section for more
detailed reference material and examples.
Listing 2. Example XML bibliography
<bib>
<book year="1994">
<title>TCP/IP Illustrated</title>
<author><last>Stevens</last><first>W.</first></author>
<publisher>Addison-Wesley</publisher>
<price> 65.95</price>
</book>
. . . more books . . .
</bib>
|
Listing 3 shows an XQuery expression that selects all books published
by Addison-Wesley after 1991, extracts their titles, and formats the
titles into a bulleted (<ul>) list. A mode switch
from "presentation mode" (data that will be passed directly to the
output, such as the <ul> and
<li> tags) to "code mode" is indicated by curly
braces; an implicit mode switch from "code mode" to "presentation
mode" occurs immediately after the return clause.
Listing 3. XQuery expression to select book titles according to query criteria
<ul>
{
for $b in doc("bib.xml")/bib/book
where $b/publisher = "Addison-Wesley" and $b/@year > 1991
return
<li>{ data($b/title) }</li>
}
</ul>
|
The query syntax, introduced with "for" and often called a "Flower
expression" (from FLWOR, an abbreviation for
for-let-where-order-return), selects a sequence of XML nodes from a
document, in this case the set of <book> nodes from
the bib.xml document using an XPath expression, and
further filtering those nodes that match the specified query criteria
(the publisher is Addison-Wesley, and the publication year is after 1991).
For each of these nodes, it computes the expression in the return
clause, which here is a mix of markup (the <li>
tags) and code (extracting the contents of the
<title> element of each <book>
node).
This simple XQuery example illustrates several aspects of XQuery -- the
mixing of presentation and code in one document, the use of XPath, the
use of substitution (the $b references), a nontrivial query
expression, an XQuery function (data()), and the fact that the
structure of the output document need not match the structure of the
input document. That's a lot of processing power in a pretty compact
and not-so-hard-to-read query.
Listing 4 shows an even simpler XQuery expression, which outputs the
number of distinct publishers in the bibliography in a single <count>
element. Like the previous example, it uses an XPath expression to
select a set of nodes, and applies XQuery functions for selecting
distinct values and counting the number of nodes. It evaluates to a
number -- the number of distinct publishers in the bib.xml document.
Listing 4. XQuery expression to count distinct publishers
<count>
{
let $d := distinct-values(doc("bib.xml")/book/publisher)
return count($d)
}
</count>
|
These examples barely scratch the surface of the types of queries that
can be performed by XQuery -- they are intended to simply give you the
flavor of the sort of thing you can do with it, and to suggest how you
can use XQuery for transforming XML documents into the format of your
choosing. While much of its power is aimed at querying large bases of
documents or other data sources, you can use a very simple subset of
XQuery to screen-scrape HTML documents to extract the parts you want
for a variety of applications, such as displaying the relevant data on
a screen-limited device such as a cell phone, or creating a
do-it-yourself portal where data from multiple sites is aggregated and
presented.
Screen-scraping with XQuery
One of the (many) challenges of screen-scraping Web pages is that they
usually have no self-identifying structure, and their structure may
change as the site content is edited, or even as different dynamic
content (such as ad content) is interpolated into the page in
different requests. As a result, you often have to guess as to which
portions of the page correspond to the data you want to extract.
Stock prices
Let's start by extracting the current price of IBM stock from the Yahoo!
Finance page (http://finance.yahoo.com/q?s=IBM). There's a lot of
stuff on this page -- news headlines, ads, financial data -- but I
want the stock price data, which is in a table cell next to the cell
that contains "Last Trade." The query in Listing 5 selects all
<td> nodes whose text contains "Last Trade," and
for each one (I expect only one), outputs a table row containing the
contents of the following <td> node. The contents
are extracted with the data() function in the
return clause; otherwise, I'd get more than just the
text in the <td> node, I'd get all the markup,
too. (The only tricky part in the query is the text()[1]
part; what's going on here is that the text() function
matches all the text nodes within the <td> element
-- in this case there is only one, but XQuery doesn't know that -- and
so I must further tell it to select the first text node before
trying to do text matching on it.) As long as the page contains a
table cell with the text "Last Trade" in it, and the following cell
contains the stock price, then the structure of the page can change
arbitrarily without causing the query to fail.
Listing 5. XQuery expression for extracting stock quotes from Yahoo! Finance
<table>
{
for $d in //td
where contains($d/text()[1], "Last Trade")
return <tr><td> { data($d/following-sibling::td) } </td></tr>
}
</table>
|
Weather
Let's try another page. The Yahoo! Weather page
contains a number of portlet panels, and I want to extract the names,
temperatures, and icons for the cities listed. (The Yahoo! Weather
page, http://weather.yahoo.com, will show weather for the cities
you've selected in your My Yahoo! if you are logged into Yahoo!, or for
a sampling of major cities if you are not.) Listing 6 shows a query
that looks for the sub-panel containing the text "New York, NY" and
then navigates up to the enclosing table and selects all the rows:
Listing 6. XQuery expression for extracting weather information from Yahoo! Weather
<table>
{
for $d in //td[contains(a/small/text(), "New York, NY")]
for $row in $d/parent::tr/parent::table/tr
where contains($d/a/small/text()[1], "New York")
return <tr><td>{data($row/td[1])}</td>
<td>{data($row/td[2])}</td>
<td>{$row/td[3]//img}</td> </tr>
}
</table>
|
Then, for each row, it extracts the three relevant data columns -- city
name, temperature, and icon -- and outputs a simpler table containing
only this information. The result is a compact display of the weather
information for the cities you care about, suitable for display on a
small screen. The results are shown below:
| Chicago, IL | 49...63 F |  | | London, UK | 32...41 F |  | | New York, NY | 36...44 F |  | | San Francisco, CA | 52...67 F |  |
This query is not quite as robust as the query in Listing 5. It assumes that the text "New York, NY"
will be inside a small element (which is the sort of markup that could
easily change the next time Yahoo! redesigns their pages). Also, "New
York, NY" could easily appear more than once on a page devoted to
weather. However, these elements of risk can be mitigated by spending
more effort developing the queries; as with many development options,
there is a tradeoff between query complexity and query stability.
The queries shown in Listing 5 and Listing 6 are not the only way these queries could
be cast. Using a more complicated XPath syntax, the two
for clauses in Listing 6 could be
folded into a single XPath expression, and the entirety of Listing 5 could be cast as an XPath expression
instead of using the FLWOR syntax. If you are an XPath guru, you will
probably find it easier to use a more XPath-oriented approach, whereas
those with more SQL experience will probably find the FLWOR syntax
more appealing.
Tools
A remarkably small amount of code is needed to execute XQuery
expressions against HTML pages. The JTidy library can be used to
clean up an HTML document and represent it as a DOM object (see Listing 1). The Saxon XQuery engine was used to
compile and execute the query against the DOM object of the document.
Compiling and executing an XQuery expression against a DOM
representation of a document requires only six lines of code, as shown
in Listing 7:
Listing 7. Code to compile and execute an XQuery expression with Saxon
Configuration c = new Configuration();
StaticQueryContext qp = new StaticQueryContext(c);
XQueryExpression xe = qp.compileQuery(query);
DynamicQueryContext dqc = new DynamicQueryContext(c);
dqc.setContextNode(new DocumentWrapper(tidyDOM, url, c));
List result = xe.evaluate(dqc);
|
The result of the query evaluation is a List of DOM
Elements, and you can use your favorite DOM manipulation
technique (OK, your least-unfavorite DOM manipulation technique) to
turn the query results into a document.
Lots of other implementations of XQuery are available, some free, some
commercial -- see Resources for some places to
look.
Summary
While XQuery was designed for querying large document bases, it serves
as a fine tool for transforming simple documents as well. Whether
simplifying complex pages for display on small screens, or extracting
elements from multiple pages to aggregate them together on a
home-grown portal, or simply extracting data from Web pages because
there's no other programmatic way to get the data, XQuery offers a
relatively easy way to scrape HTML pages for the data you need.
Resources - Participate in the discussion forum.
- Howard Katz's An introduction to XQuery
(developerWorks, June 2001) covers the basics and history of the XQuery standardization effort.
- The tutorial, Process
XML using XML Query (developerWorks, September 2002), by Nicholas Chase dives deeper into the uses and syntax of
XQuery.
- You can read about Sam Pullara's cell phone in his blog.
- Download JTidy from its home on SourceForge.
- Check out the Saxon XQuery and XSL implementation.
- You can try out the free
community edition of the Mark Logic server, a content database which
lets you search large document bases with XQuery.
- The official specifications for XQuery can be downloaded from the W3C site; this page also hosts
a list of XQuery implementations.
- This set of slides from an XQuery
tutorial offers a lot of good examples of what XQuery is good for
and how to use it.
- To learn more about Java technology, visit the
developerWorks Java zone. You'll find technical
documentation, how-to articles,
education, downloads, product information, and more.
- To learn more about XML, visit the
developerWorks XML zone. As with the Java zone, you'll find technical
documentation, how-to articles,
education, downloads, product information, and more.
- Visit the New to Java
technology site for the latest resources to help you get started with Java
programming.
- Get involved in the developerWorks community by
participating in
developerWorks blogs.
- Browse for books on these and other technical topics.
About the author  | |  | Brian Goetz has been a professional software developer for over 18 years. He is a Principal Consultant at Quiotix, a software development and consulting firm located in Los Altos, California, and he serves on several JCP Expert Groups. See Brian's published and upcoming articles in popular industry publications. |
Rate this page
|  |