 | Level: Intermediate Nicholas Chase (nicholas@nicholaschase.com), President, Chase and Chase, Inc.
23 Sep 2003 In this article, Nick shows you how to retrieve syndicated content and convert it into headlines for your site. Since no official format for such feeds exists, aggregators are often faced with the difficulty of supporting multiple formats, so Nick also explains how to use XSL transformations to more easily deal with multiple syndication file formats.
With the popularization of weblogging, information overload is worse than ever.
Readers now have more sites than ever to keep up with, and visiting all of them
on a regular basis is next to impossible. Part of the problem can be solved through
the syndication of content, in which a site makes its headlines and basic information
available in a separate feed. Today, most of these feeds use an XML format called
RSS, though there are variations in its use and even a potential competing format.
This article explains how to use Java technology to retrieve the content of a syndicated feed,
determine its type, and then transform it into HTML and display it on a Web site.
This process involves five steps:
- Retrieve the XML feed
- Analyze the feed
- Determine the proper transformation
- Perform the transformation
- Display the result
This article chronicles the creation of a Java Server Page (JSP) that retrieves
a remote feed and transforms it using a Java bean and XSLT, and then incorporates
the newly transformed information into a JSP page. The concepts, however, apply to
virtually any Web environment.
The source file
Depending on whom you ask, RSS stands for RDF Site Summary, Rich Site Summary,
or other acronyms that are less tactful. In any case, no fewer than
four versions of RSS are in common usage, from the fairly simple 0.91,
which doesn't include namespaces and imposes some strict limits on content, to
version 2.0, which encompasses versions back to 0.91 (so a valid 0.91
file is also a valid 2.0 file) but also allows the use of namespaces. By allowing
namespaces, version 2.0 makes it possible for a syndicator to
add elements to the feed, as long as they're in a different namespace.
Some syndicators use this capability to add information using Resource Definition
Format (RDF).
A simple RSS 2.0 file might look like this feed from Adam Curry's weblog (see Resources):
Listing 1. A sample RSS 2.0 message
<?xml version="1.0"?>
<rss version="2.0">
<channel>
<title>Adam Curry: Adam Curry's Weblog</title>
<link>http://www.blognewsnetwork.com/members/0000001/</link>
<description>News and Views from Adam Curry</description>
<language>en-us</language>
<copyright>Copyright 2003 Adam Curry</copyright>
<lastBuildDate>Thu, 24 Jul 2003 09:26:48 GMT</lastBuildDate>
<docs>http://backend.userland.com/rss</docs>
<generator>Radio UserLand v8.0.9b2</generator>
<managingEditor>adam@curry.com</managingEditor>
<webMaster>adam@curry.com</webMaster>
<item>
<title>weblog at work again</title>
<link>
http://www.blognewsnetwork.com/members/0000001/2003/07/24.html#a4158
</link>
<description><a href="http://radio.weblogs.com/0001014/images/2003/07/24/ad
amwheely.jpg"><img src="http://radio.weblogs.com/0001014/images/2003/07/24/
adamwheely.jpg" width="250" height="187.5" border="0" align="right" hspace="15" v
space="5" alt="A picture named adamwheely.jpg"></a>A few days ago I aske
d if anyone had taken pictures of me at the annual ...</description>
<guid>
http://www.blognewsnetwork.com/members/0000001/2003/07/24.html#a4158
</guid>
<pubDate>Thu, 24 Jul 2003 09:21:25 GMT</pubDate>
</item>
<item>
<title>teens trouble with web</title>
<link>
http://www.blognewsnetwork.com/members/0000001/2003/07/23.html#a4156
</link>
<description>According to a report from Northumbria University, most teenagers
lack the <a href="http://www.web-user.co.uk/news/news.php?id=33621">inform
ation gathering skills</a> needed for using the internet efficiently. This
sounds like it shouldn't be happening in ...</description>
<guid>
http://www.blognewsnetwork.com/members/0000001/2003/07/23.html#a4156
</guid>
<pubDate>Wed, 23 Jul 2003 17:36:23 GMT</pubDate>
</item>
...
</channel>
</rss> |
To turn this feed into HTML, you can process it using XSL transformations.
The primary stylesheet
The ultimate goal is to generate HTML text that shows the information in an organized way, such as a list of links, included in the body of another page of information. The actual HTML output would be something like:
Listing 2. The output HTML
<h2>Adam Curry: Adam Curry's Weblog</h2>
<h3>News and Views from Adam Curry</h3>
<ul>
<li>
<a
href="http://www.blognewsnetwork.com/members/0000001/2003/07/24.html#a4158">weblog
at work again</a>
<p><a href="http://radio.weblogs.com/0001014/images/2003/07/24/adamwheely.jpg">
<img src="http://radio.weblogs.com/0001014/images/2003/07/24/adamwheely.jpg"
width="250" height="187.5" border="0" align="right" hspace="15" vspace="5" alt="A
picture named adamwheely.jpg"></a>A few days ago I asked if anyone had taken
pictures of me at the annual ...
</li>
<li>
<a
href="http://www.blognewsnetwork.com/members/0000001/2003/07/23.html#a4156">teens
trouble with web</a>
<p>According to a report from Northumbria University, most teenagers lack the
<a href="http://www.web-user.co.uk/news/news.php?id=33621">information gathering
skills</a> needed for using the internet efficiently. This sounds like it
shouldn't be happening in ...
</li>
...
</ul>
|
To create this HTML out of the XML, you'll need an XSLT stylesheet:
Listing 3. The simple stylesheet
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html"/>
<xsl:template match="/">
<xsl:apply-templates select="//channel"/>
<ul>
<xsl:apply-templates select="//item"/>
</ul>
</xsl:template>
<xsl:template match="channel">
<xsl:apply-templates select="../image"/>
<h2><xsl:value-of select="title"/></h2>
<h3><xsl:value-of select="description"/></h3>
</xsl:template>
<xsl:template match="item">
<li>
<xsl:element name="a">
<xsl:attribute name="href"><xsl:value-of select="link"/></xsl:attribute>
<xsl:value-of select="title" />
</xsl:element>
<p><xsl:value-of disable-output-escaping="yes" select="description" /></p>
</li>
</xsl:template>
<xsl:template match="image">
<xsl:element name="img">
<xsl:attribute name="src"><xsl:value-of select="url"/></xsl:attribute>
<xsl:attribute name="style">float:left; padding: 10px;</xsl:attribute>
</xsl:element>
</xsl:template>
<xsl:template match="language">
</xsl:template>
</xsl:stylesheet>
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html"/>
<xsl:template match="/">
<xsl:apply-templates select="//channel"/>
<ul>
<xsl:apply-templates select="//item"/>
</ul>
</xsl:template>
<xsl:template match="channel">
<xsl:apply-templates select="../image"/>
<h2><xsl:value-of select="title"/></h2>
<h3><xsl:value-of select="description"/></h3>
</xsl:template>
<xsl:template match="item">
<li>
<xsl:element name="a">
<xsl:attribute name="href"><xsl:value-of select="link"/></xsl:attribute>
<xsl:value-of select="title" />
</xsl:element>
<p><xsl:value-of disable-output-escaping="yes" select="description" /></p>
</li>
</xsl:template>
<xsl:template match="image">
<xsl:element name="img">
<xsl:attribute name="src"><xsl:value-of select="url"/></xsl:attribute>
<xsl:attribute name="style">float:left; padding: 10px;</xsl:attribute>
</xsl:element>
</xsl:template>
<xsl:template match="language">
</xsl:template>
</xsl:stylesheet> |
The actual form of the page is entirely up to you, as is the data that you choose
to include. In this case, you're simply creating a bulleted list of entries, with
a title (if there is one) that links back to the original post and the
description for each post.
To actually perform the transformation, you need to create a JSP page.
The basic JSP page
Any number of ways of transforming XML data exist. In this article, I'll show you how to
create a JSP page that passes a feed to a Java bean for transformation. That bean creates
a static file, and the JSP page incorporates it into the body of the page. (The reason for
the static file will become clearer in the caching section below.)
The page itself is fairly straightforward:
Listing 4. The JSP page
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<jsp:useBean id="rssBean" scope="request" class="RSSProcessor">
<%
rssBean.setRSSFile(
"http://wolk.datashed.net/users/adam@curry.com/curryCom.xml");
%>
</jsp:useBean>
<html>
<head>
<title>Syndicated Feeds</TITLE>
</head>
<body>
<jsp:include page="headlines.html" flush="true"/>
</body>
</html> |
Here you're simply creating an instance of the RSSProcessor
class. Because you've included it in the useBean element,
the setRSSFile() method executes when the object is created.
This method creates the headlines.html page that the JSP page then
incorporates into the output.
Next, create the bean to do the transformation.
Transforming the file
The Java bean is nothing more than a Java class that has get and set methods.
In this case, the set method, setRSSFile() also includes
code that performs a transformation on that file:
Listing 5. Transforming the feed
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import java.io.FileOutputStream;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.Transformer;
public class RSSProcessor {
public RSSProcessor(){ }
String _RSSFile;
public String getRSSFile(){
return _RSSFile;
}
public void setRSSFile(String fileName){
try {
StreamSource source = new StreamSource(fileName);
StreamSource finalStyle = new StreamSource("final.xsl");
String outputURL = "headlines.html";
StreamResult result = new StreamResult(new
FileOutputStream(outputURL));
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer transformer = transFactory.newTransformer(finalStyle);
transformer.transform(source, result);
} catch (Exception e) {
e.printStackTrace();
}
}
}
|
This method simply takes an input source, which happens to be a remote RSS feed,
and transforms it, using the final.xsl stylesheet, to
the headlines.html file.
In the grand scheme of things, that's it: Retrieve the file, transform it, and
display the results. In reality, there are other issues to consider.
Adjusting for multiple formats
If all RSS files were like this sample, you wouldn't need to do anything else.
Unfortunately, this is not the case. Different vendors and toolkits can produce
additional information, or can replace core information with RDF information
or other namespaced modules,
leading to complaints that supporting RSS is complex because of all the
variations. But with the use of XSL transformations, it doesn't have to be that way.
For example, an RSS 2.0 feed might also contain RDF information, like this feed
from Typographica:
Listing 6. Excerpt from sample RSS 2.0 message with RDF
<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:admin="http://webns.net/mvcb/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
<title>Typographica</title>
<link>http://typographi.ca/</link>
<description>A daily journal of typography featuring news, observations,
and open commentary on fonts and typographic design.</description>
<dc:language>en-us</dc:language>
<dc:creator>Stephen Coles</dc:creator>
<dc:rights>Copyright 2003</dc:rights>
<dc:date>2003-07-24T00:00:52-08:00</dc:date>
<admin:generatorAgent rdf:resource="http://www.movabletype.org/?v=2.63" />
<admin:errorReportsTo rdf:resource="mailto:scoles@gomakecontact.com" />
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<sy:updateBase>2000-01-01T12:00+00:00</sy:updateBase>
<item>
<title>Hot and Cold Fonts</title>
<link>http://typographi.ca/000643.php</link>
<description>LettError have developed a multiple master font
for the Design Institute of the University of Minnesota that varies
along three...</description>
<guid isPermaLink="false">643@http://typographi.ca/</guid>
<content:encoded><![CDATA[<p><a href="http://www.letterror.com/">
LettError</a> have developed a multiple master font for the
<a href="http://design.umn.edu/">Design Institute</a> of the University of
Minnesota that varies along three dimensions: formality, informality, and
"weirdness." (It's apparently possible to be 100% formal and 100% informal at
the same time.) As the New York Times...]]></content:encoded>
<dc:subject></dc:subject>
<dc:date>2003-07-24T00:00:52-08:00</dc:date>
</item>
<item>
<title>Textura Digita</title>
<link>http://typographi.ca/000642.php</link>
<description>CNN reports that the Gutenberg Bible is now available
on the web via the Ransom Center at the University of...</description>
<guid isPermaLink="false">642@http://typographi.ca/</guid>
<content:encoded><![CDATA[<p><a href=
"http://www.cnn.com/2003/TECH/internet/07/23/digital.scripture.ap/index.html">
CNN reports</a> that the Gutenberg Bible is now available on the web via the
<a href="http://www.hrc.utexas.edu/exhibitions/permanent/gutenberg/">Ransom
Center</a> at the University of Texas.</p>
...]]></content:encoded>
<dc:subject></dc:subject>
<dc:date>2003-07-23T13:16:15-08:00</dc:date>
</item>
<item>
<title>Fight! Fight! Fight!</title>
<link>http://typographi.ca/000640.php</link>
<description>Angry because you had to miss TypeCon ’03?
Work out that aggression with Helvetica vs. Arial....</description>
<guid isPermaLink="false">640@http://typographi.ca/</guid>
<content:encoded><![CDATA[<p>Angry because you had to miss
<a href="http://www.typecon2003.com/">TypeCon ’03</a>? Work out that
aggression with <a href="http://www.engagestudio.com/helvetica/">Helvetica vs.
Arial</a>.</p>]]></content:encoded>
<dc:subject></dc:subject>
<dc:date>2003-07-22T08:52:36-08:00</dc:date>
</item>
...
</channel>
</rss> |
Notice that this feed actually contains two different descriptions of the content.
The first is in the description element, and the second is
in the encoded element, which is part of the http://purl.org/rss/1.0/modules/content/
namespace. Here you see the difference in how different feeds handle information.
Adam Curry's blog simply encodes information such as links and drops them into
the description
element, whereas Typographica (or rather the toolkit that produces Typographica's feed)
provides a non-markup version in the description element and a full
version in the encoded element using a CDATA
construct.
Although it is preferable to create a custom presentation
for each feed type in order to take advantage of any extra information, this is
not always practical from an application development standpoint. But that doesn't mean
you have to give up. Instead, you can create a transformation that simply takes
different feeds and converts them to a standard structure, which you can then
feed to the final transformation.
For example, you can create a stylesheet that takes an RSS 2.0 stylesheet
and if it finds an encoded element, uses it to replace
any description element:
Listing 7. Transforming RDF information
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<rss>
<channel>
<xsl:apply-templates select="rss/channel" />
</channel>
</rss>
</xsl:template>
<xsl:template match="title|link|/rss/channel/description|image|text()">
<xsl:copy-of select="." />
</xsl:template>
<xsl:template match="item" >
<item>
<title><xsl:value-of select="title" /></title>
<link><xsl:value-of select="link" /></link>
<description><xsl:value-of select="description" /></description>
</item>
</xsl:template>
<xsl:template match="item[encoded]" >
<item>
<title><xsl:value-of select="title" /></title>
<link><xsl:value-of select="link" /></link>
<description><xsl:value-of select="encoded" /></description>
</item>
</xsl:template>
</xsl:stylesheet> |
This stylesheet makes copies of the elements that the final stylesheet
will need, such as the channel's title and description, and makes a copy of the
item with the appropriate description information.
Now you just have to weave that new document into the final transformation:
Listing 8. Chaining the transformation
...
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.dom.DOMResult;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
public class RSSProcessor {
...
public void setRSSFile(String fileName){
try {
StreamSource interimSource = new StreamSource(fileName);
String XSLSheetName = "2.0.xsl";
StreamSource style = new StreamSource(XSLSheetName);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document interimDoc = db.newDocument();
DOMResult interimResult = new DOMResult(interimDoc);
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer interimTransformer = null;
interimTransformer = transFactory.newTransformer(style);
interimTransformer.transform(interimSource, interimResult);
DOMSource source = new DOMSource(interimDoc);
StreamSource finalStyle = new StreamSource("final.xsl");
String outputURL = "headlines.html";
StreamResult result = new StreamResult(new
FileOutputStream(outputURL));
Transformer transformer = transFactory.newTransformer(finalStyle);
transformer.transform(source, result);
} catch (Exception e) {
e.printStackTrace();
}
}
} |
Take a look at this one step at a time. First of all, you're creating an interim
transformation that takes the intial feed and transforms it according to the
interim stylesheet in Listing 7, named 2.0.xsl.
The result of this first transformation goes not to a file, but to a DOM
Document
object, which then gets passed as the source for the second transformation.
The name of the interim stylesheet, 2.0.xsl,
was deliberate. By naming it after the version, you can create a more flexible system.
Choosing a version
As long as you're allowing for different formats, you can actually create a system
that checks for the feed version before processing it. After all, only RSS 1.0 and 2.0
feeds can have RDF elements, so there's no need to process other feeds. But how can you
tell what version to apply?
To solve this problem, you can load the actual feed, analyze it, and use the
information to set the proper stylesheet.
Listing 9. Choosing a stylesheet
...
import org.xml.sax.InputSource;
import org.w3c.dom.Element;
public class RSSProcessor {
...
public void setRSSFile(String fileName){
try {
InputSource docFile = new InputSource (fileName);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document inputDoc = db.parse(docFile);
Element rss = inputDoc.getDocumentElement();
String version = null;
if (rss.getNodeName().equals("rss")){
version = rss.getAttribute("version");
if (version == null) {
version = "0.91";
}
} else if (rss.getNodeName().equals("feed")){
version = "echo";
}
String XSLSheetName = version+".xsl";
StreamSource style = new StreamSource(XSLSheetName);
DOMSource interimSource = new DOMSource(inputDoc);
Document interimDoc = db.newDocument();
DOMResult interimResult = new DOMResult(interimDoc);
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer interimTransformer = null;
if (version.equals("0.91")){
interimTransformer = transFactory.newTransformer();
} else {
interimTransformer = transFactory.newTransformer(style);
}
interimTransformer.transform(interimSource, interimResult);
DOMSource source = new DOMSource(interimDoc);
StreamSource finalStyle = new StreamSource("final.xsl");
String outputURL = "headlines.html";
StreamResult result = new StreamResult(new
FileOutputStream(outputURL));
Transformer transformer = transFactory.newTransformer(finalStyle);
transformer.transform(source, result);
} catch (Exception e) {
e.printStackTrace();
}
}
} |
In this case, you're loading the feed and checking it for the RSS version,
and then using the version number as the file name. The advantage here is that should
a new version of RSS be released, you can extend the application by simply
adding a new stylesheet. Notice that I've added a check for Echo, or
Atom, or whatever RSS's competitor might eventually be called, and that you
can also adjust support for it as it changes by simply changing the
echo.xsl stylesheet.
The advantage here is that this interim stylesheet is completely generic.
A "2.0 - .91" stylesheet will work for anyone, anywhere, and you can make changes
to the final output by simply editing final.xsl, whether you support
one version or a hundred.
The final.xsl stylesheet is designed for a simple
0.91-style feed, so if you're dealing with one, you'll omit the stylesheet on the
interim transformation. This creates an identity transform, in which the
document is simply passed along as-is.
That takes care of the problem of multiple versions, but you have one more
issue to deal with: concurrency.
Caching the feed
This system would work fine on a personal server where you're the only one
accessing it, but in the real world, it would be impractical (and rude) to pull the
feed every time someone wants to read it. Instead, you need to build the system with
some sort of time delay, so if the feed's been pulled recently, the existing headlines.html file is used.
To do that, you can take advantage of a Java application's nature. A
static variable that represents the last
time the feed was pulled would be constant for all instances of the RSSProcessor
class, so you can check the current time against it before actually pulling the feed:
Listing 10. Choosing a stylesheet
import java.util.Date;
public class RSSProcessor {
...
static Date _LastUpdated = new Date();
public Date getLastUpdated(){
return _LastUpdated;
}
public void setRSSFile(String fileName){
Date now = new Date();
long diff = now.getTime() - _LastUpdated.getTime();
double interval = .5;
if ((diff == 0) || (diff > (interval * 60 * 1000))){
_LastUpdated = now;
try {
InputSource docFile = new InputSource (fileName);
...
Transformer transformer = transFactory.newTransformer(finalStyle);
transformer.transform(source, result);
} catch (Exception e) {
e.printStackTrace();
}
}
}
} |
The first time the server instantiates RSSProcessor,
_LastUpdated
gets initialized with the current date. At (essentially) the same time, the server executes
the setRSSFile() method, and because the difference between the
current time and the _LastUpdated time is zero, the transformation
takes place.
The next time someone calls the page, a new instance of RSSProcessor is created, but because _LastUpdated is static, the new
instance sees the existing value of _LastUpdated rather than
initializing it. The interval is measured in minutes, with the difference between
_LastUpdated and the current time measured in milliseconds.
If the amount time that has elapsed is less than the interval, nothing else happens.
The headlines.html file isn't updated, so the server uses the old one instead.
If, on the other hand, the interval has passed, _LastUpdated
gets the current time, which is passed on to any subsequent RSSProcessor
objects, and the bean pulls a new copy of the feed to transform.
Conclusion
In this article, I've shown you how to create a syndicated feed reader that retrieves
a single remote feed, transforms it using XSLT, and displays it as part of a Web page.
The system can also adapt to multiple feed types through the use of XSLT stylesheets.
The application uses a DOM Document to analyze
the feed and determine the appropriate stylesheet, but you can further extend
it by moving some of that logic into an external stylesheet. You can also adapt the
system so that it can pull more than one feed, perhaps based on a user selection,
with each one creating its own cached file. Similarly, you can enable the user to
determine the interval between feed retrievals.
Resources
About the author  | 
|  | Nicholas Chase, a Studio B author, has been involved in Web site development for companies such as Lucent Technologies, Sun Microsystems, Oracle, and the Tampa Bay Buccaneers. Nick has been a high school physics teacher, a low-level radioactive waste facility manager, an online science fiction magazine editor, a multimedia engineer, and an Oracle instructor. More recently, he was the Chief Technology Officer of Site Dynamics Interactive Communications in Clearwater, Florida, USA, and is the author of four books on Web development, including XML Primer Plus (Sams). He loves to hear from readers and can be reached at nicholas@nicholaschase.com. |
Rate this page
|  |