Skip to main content

skip to main content

developerWorks  >  Linux | Open source | Web development  >

Build a Web spider on Linux

A simple spider and scraper collects Internet content

developerWorks
Document options

Document options requiring JavaScript are not displayed


Learn and share!

Exchange know-how with your peers -- try our new Pass It Along beta app


Rate this page

Help us improve this content


Level: Intermediate

M. Tim Jones (mtj@mtjones.com), Consultant Engineer, Emulex

14 Nov 2006

Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders. This article shows you how to build spiders and scrapers for Linux® to crawl a Web site and gather information, stock data, in this case

A spider is a program that crawls the Internet in a specific way for a specific purpose. The purpose could be to gather information or to understand the structure and validity of a Web site. Spiders are the basis for modern search engines, such as Google and AltaVista. These spiders automatically retrieve data from the Web and pass it on to other applications that index the contents of the Web site for the best set of search terms.

Web spiders as agents

Web spiders and scrapers are simply another form of software robot or agent (as coined by Alan Kay in the early 1980s). Alan's idea of an agent was as a proxy for the user in the computer's world. The agent could be given a goal and work towards that goal in its domain. If it got stuck, it could request advice from the user and continue on to fulfill its goal.

Today, agents are classified with attributes such as autonomy, adaptiveness, communication, and collaboration with other agents. Other attributes, such as agent mobility and even personality, are goals of agent research today. The Web spiders in this article are classified as Task-Specific Agents in the agent taxonomy.

Similar to a spider, but with more interesting legal questions, is the Web scraper. A scraper is a type of spider that targets specific content from the Web, such as the cost of products or services. One use of the scraper is for competitive pricing, to identify the price of a given product to tailor your price or advertise it accordingly. A scraper can also aggregate data from a number of Web sources and provide that information to a user.

Biological motivation

When you think of a spider in nature, you think of it in its interactions with an environment, not in isolation. The spider sees and feels its way around, moving from one place to another in a meaningful way. Web spiders operate in a similar way. A Web spider is a program written in a high-level language. It interacts with its environment through the use of networking protocols, such as the Hypertext Transfer Protocol (HTTP) for the Web. If your spider wants to communicate with you, it can use the Simple Mail Transfer Protocol (SMTP) to send an e-mail message.

Spiders aren't limited to HTTP or SMTP, though. Some spiders use Web services, such as SOAP or the Extensible Markup Language Remote Procedure Call (XML-RPC) protocol. Other spiders scour newsgroups with the Network News Transfer Protocol (NNTP) or look for interesting news items in Really Simple Syndication (RSS) feeds. While most spiders in nature can see only light-dark intensity and movement changes, Web spiders can see and feel using many types of protocols.



Back to top


Applications of spiders and scrapers

Spider's eyes and legs

The Web spider's primary means of looking and moving around the Internet is HTTP. HTTP is a message-oriented protocol where a client connects to a server and issues requests. The server provides a response. Each request and response is made up of a header and a body, with the header providing status information and a description of the contents of the body.

HTTP provides three basic types of requests. The first is HEAD, which requests information about an asset at the server. The second is GET, which requests an asset, such as a file or an image. Finally, the POST request allows the client to interact with the server through a Web page (commonly through a Web form).

Web spiders and scrapers are useful applications, and, therefore, you can find a variety of different types in use for both good and evil. Let's look at some of the applications that use these technologies.

Search engine Web crawlers

Web spiders make searching the Internet easy and efficient. A search engine uses many Web spiders to crawl the Web pages on the Internet, return their content, and index it. After this is done, the search engine can quickly search the local index to identify the most applicable results for your search. Google also uses the PageRank algorithm, where a Web page's rank in the search results is based on how many other pages link to it. This serves as a vote, where pages with the highest votes get the highest rank in the results.

Searching the Internet like this can be expensive, both in terms of the bandwidth required to communicate Web content to the indexer and also the computational expense of indexing the results. Lots of storage is required for this, but apparently it isn't a problem when you consider Google offers 1,000 megabytes of storage for Gmail users.

Web spiders minimize the drain on the Internet using a set of policies. To give you some idea of the scope of the challenge, Google indexes more than eight billion Web pages. The behavior policies define which pages the crawler will bring down to the indexer, how often to go back to a Web site to check it again, and something called a politeness policy. Web servers can exclude crawlers using a file called robot.txt that tells the crawler what can and can't be crawled.

Corporate Web crawlers

Like the standard search engine spider, the corporate Web spider indexes content that is not available to the general public. For example, companies commonly have internal Web sites that are used by employees. This type of spider is constrained to the local environment. Because its search is restricted, there is usually more computing power available, and specialized and more complete indexes are possible. Google has taken this one step further by providing a desktop search engine to index the content of your personal computer.

Specialized crawlers

There are also a number of non-traditional uses for crawlers, such as archiving content or generating statistics. An archiving crawler simply crawls a Web site pulling content locally to be stored on a long-term storage medium. This can be used for backup or, in more grand cases, to take a snapshot of the content of the Internet. Statistics can be useful in understanding the content of the Internet or the lack thereof. Crawlers can be used to identify how many Web servers are running, how many Web servers of a given type are running, the number of Web pages that are available, and even the number of broken links (those that return the HTTP 404 error, page not found).

Other useful specialized crawlers include Web site checkers. These crawlers look for missing content, validate all links, and ensure that your Hypertext Markup Language (HTML) is valid.

E-mail harvesting crawlers

Now to the dark side. It's unfortunate, but a few bad apples can ruin the Internet for the rest of us. E-mail harvesting crawlers search Web sites for e-mail addresses that are then used to generate the mass of spam that we all deal with each day. Postini reports that, as of August 2005, 70% of all e-mail messages processed for Postini users is unwanted spam.

E-mail harvesting can be one of the easiest crawling activities, as you'll see in the final crawler example in this article.

Now that we've looked at some of the basics of Web spiders and scrapers, the next four examples show how easily you can build spiders and scrapers for Linux with modern scripting languages, such as Ruby and Python.



Back to top


Example 1: Simple scraper

This example shows you how to figure out what kind of Web server is being run for a given Web site. This can be interesting and, if done on a large enough sample, can provide some intriguing statistics on the penetration of Web servers in government, academia, and industry.

Listing 1 shows a Ruby script that scrapes a Web site to identify the HTTP server. The Net::HTTP class implements an HTTP client and the GET, HEAD, and POST HTTP methods. Whenever you make a request to an HTTP server, part of the HTTP message response indicates the server from which the content is served. Rather than download a page from the site, I simply use the HEAD method to get information about the root page ('/'). As long as the HTTP server responds with success (indicated by a "200" response code), I iterate through each line of the response searching for the server key, and, if found, I print the value. The value for this key is a string representing the HTTP server.


Listing 1. Ruby script for simple metadata scraping (srvinfo.rb)

#!/usr/local/bin/ruby
require 'net/http'

# Get the first argument from the command-line (the URL)
url = ARGV[0]

begin

  # Create a new HTTP connection
  httpCon = Net::HTTP.new( url, 80 )

  # Perform a HEAD request
  resp, data = httpCon.head( "/", nil )

  # If it succeeded (200 is success)
  if resp.code == "200" then

    # Iterate through the response hash
    resp.each {|key,val|

      # If the key is the server, print the value
      if key == "server" then

        print "  The server at "+url+" is "+val+"\n"

      end

    }

  end

end

In addition to showing how to use the srvinfo script, Listing 2 shows some results from a number of government, academic, and business Web sites. There is quite a bit of diversity, from Apache (68% penetration) to Sun and Microsoft® Internet Information Services (IIS). You can also see a case where the server is not reported. It's fun to note that the Federated States of Micronesia is running an old version of Apache (time to update), and Apache.org is on the bleeding edge.


Listing 2. Example usage of the server scraper

[mtj@camus]$ ./srvrinfo.rb www.whitehouse.gov
  The server at www.whitehouse.gov is Apache
[mtj@camus]$ ./srvrinfo.rb www.cisco.com
  The server at www.cisco.com is Apache/2.0 (Unix)
[mtj@camus]$ ./srvrinfo.rb www.gov.ru
  The server at www.gov.ru is Apache/1.3.29 (Unix)
[mtj@camus]$ ./srvrinfo.rb www.gov.cn
[mtj@camus]$ ./srvrinfo.rb www.kantei.go.jp
  The server at www.kantei.go.jp is Apache
[mtj@camus]$ ./srvrinfo.rb www.pmo.gov.to
  The server at www.pmo.gov.to is Apache/2.0.46 (Red Hat Linux)
[mtj@camus]$ ./srvrinfo.rb www.mozambique.mz
  The server at www.mozambique.mz is Apache/1.3.27 
   (Unix) PHP/3.0.18 PHP/4.2.3
[mtj@camus]$ ./srvrinfo.rb www.cisco.com
  The server at www.cisco.com is Apache/1.0 (Unix)
[mtj@camus]$ ./srvrinfo.rb www.mit.edu
  The server at www.mit.edu is MIT Web Server Apache/1.3.26 Mark/1.5 
	(Unix) mod_ssl/2.8.9 OpenSSL/0.9.7c
[mtj@camus]$ ./srvrinfo.rb www.stanford.edu
  The server at www.stanford.edu is Apache/2.0.54 (Debian GNU/Linux) 
	mod_fastcgi/2.4.2 mod_ssl/2.0.54 OpenSSL/0.9.7e WebAuth/3.2.8
[mtj@camus]$ ./srvrinfo.rb www.fsmgov.org
  The server at www.fsmgov.org is Apache/1.3.27 (Unix) PHP/4.3.1
[mtj@camus]$ ./srvrinfo.rb www.csuchico.edu
  The server at www.csuchico.edu is Sun-ONE-Web-Server/6.1
[mtj@camus]$ ./srvrinfo.rb www.sun.com
  The server at www.sun.com is Sun Java System Web Server 6.1
[mtj@camus]$ ./srvrinfo.rb www.microsoft.com
  The server at www.microsoft.com is Microsoft-IIS/6.0
[mtj@camus]$ ./srvrinfo.rb www.apache.org
The server at www.apache.org is Apache/2.2.3 (Unix) 
	mod_ssl/2.2.3 OpenSSL/0.9.7g

That's useful data, and it's interesting to see what governments and academic institutions use for their Web servers. The next example shows something a little more useful, a stock quote scraper.



Back to top


Example 2: Stock quote scraper

In this example, I build a simple Web scraper (also called a screen scraper) to collect stock quote information. I do this in a brute-force way by exploiting a pattern in the response Web page, like so:


Listing 3. A simple Web scraper for stock quotes

#!/usr/local/bin/ruby
require 'net/http'

host = "www.smartmoney.com"
link = "/eqsnaps/index.cfm?story=snapshot&symbol="+ARGV[0]

begin

  # Create a new HTTP connection
  httpCon = Net::HTTP.new( host, 80 )

  # Perform a HEAD request
  resp = httpCon.get( link, nil )

  stroffset = resp.body =~ /class="price">/

  subset = resp.body.slice(stroffset+14, 10)

  limit = subset.index('<')

  print ARGV[0] + " current stock price " + subset[0..limit-1] +
          " (from stockmoney.com)\n"

end

In this Ruby script, I open an HTTP client connect to a server (in this case, www.smartmoney.com) and build a link that specifically requests a stock quote as passed in by the user (via &symbol=<symbol>). I request this link using the HTTP GET method (to retrieve the full response page) and then search for class="price">, which is immediately followed by the stock's current price. This is cut out of the Web page and then displayed for the user.

To use the stock quote scraper, I simply invoke the script with the stock symbol of interest, as shown in Listing 4.


Listing 4. Example usage of the stock quote scraper

[mtj@camus]$ ./stockprice.rb ibm
ibm current stock price 79.28 (from stockmoney.com)
[mtj@camus]$ ./stockprice.rb intl
intl current stock price 21.69 (from stockmoney.com)
[mtj@camus]$ ./stockprice.rb nt
nt current stock price 2.07 (from stockmoney.com)
[mtj@camus]$



Back to top


Example 3: Communicating stock quote scraper

The Web scraper for stock quotes shown in Example 2 was engaging, but it would be really useful to have this scraper routinely monitor the stock price and let you know if your favorite stock has risen above a certain value or dropped below another. Your wait is over. In Listing 5, I update the simple Web scraper to routinely monitor the stock and send an e-mail message when the stock has moved outside of a defined price range.


Listing 5. Stock scraper that can send an e-mail alert

#!/usr/local/bin/ruby
require 'net/http'
require 'net/smtp'

#
# Given a web-site and link, return the stock price
#
def getStockQuote(host, link)

    # Create a new HTTP connection
    httpCon = Net::HTTP.new( host, 80 )

    # Perform a HEAD request
    resp = httpCon.get( link, nil )

    stroffset = resp.body =~ /class="price">/

    subset = resp.body.slice(stroffset+14, 10)

    limit = subset.index('<')

    return subset[0..limit-1].to_f

end


#
# Send a message (msg) to a user.
# Note: assumes the SMTP server is on the same host.
#
def sendStockAlert( user, msg )

    lmsg = [ "Subject: Stock Alert\n", "\n", msg ]
    Net::SMTP.start('localhost') do |smtp|
      smtp.sendmail( lmsg, "rubystockmonitor@localhost.localdomain", [user] )
    end

end


#
# Our main program, checks the stock within the price band every two
# minutes, emails and exits if the stock price strays from the band.
#
# Usage: ./monitor_sp.rb <symbol> <high> <low> <email_address>
#
begin

  host = "www.smartmoney.com"
  link = "/eqsnaps/index.cfm?story=snapshot&symbol="+ARGV[0]
  user = ARGV[3]

  high = ARGV[1].to_f
  low = ARGV[2].to_f

  while 1

    price = getStockQuote(host, link)

    print "current price ", price, "\n"

    if (price > high) || (price < low) then

      if (price > high) then
        msg = "Stock "+ARGV[0]+" has exceeded the price of "+high.to_s+
               "\n"+host+link+"\n"
      end

      if (price < low) then
        msg = "Stock "+ARGV[0]+" has fallen below the price of "+low.to_s+
               "\n"+host+link+"\n"

      end

      sendStockAlert( user, msg )

      exit

    end

    sleep 120

  end

end

This Ruby script is a bit longer, but it builds on the existing stock scraping script from Listing 3. A new function, getStockQuote, encapsulates the stock scraping function. Another function, sendStockAlert, sends a message to an e-mail address (both are user-defined). The main program is nothing more than a loop to get the current stock price, check to see if it's in band, and, if not, send an e-mail alert to the user. I also delay between checking the stock price because I'm polite and don't want to overload the server.

Listing 6 is a sample invocation of the stock monitor with a popular technology stock. Every two minutes, the stock is checked and printed out. When the stock exceeds the high limit, an e-mail alert is sent and the script exits.


Listing 6. Stock monitor script demonstration


[mtj@camus]$ ./monitor_sp.rb ibm 83.00 75.00 mtj@mtjones.com
current price 82.06
current price 82.32
current price 82.75
current price 83.36

The resulting e-mail is shown in Figure 1, complete with a link to the source of the scraped data.


Figure 1. E-mail alert sent by the Ruby script in Listing 5
E-mail alert sent by the Ruby script in Listing 5

Now I'll leave scrapers and dig into the construction of a Web spider.



Back to top


Example 4: Web site crawler

In this final example, I explore a Web spider that crawls a Web site. For safety, I avoid straying outside of the site, but instead simply dig down into a single Web page.

To crawl a Web site and follow the links that are provided within it, you must parse HTML pages. If you can successfully parse a Web page, you can identify links to other resources. Some specify local resources (files), but others represent non-local resources (such as links to other Web pages).

To crawl the Web, you start with a given Web page, identify all of the links that are on that page, queue them to a to-visit queue, and then repeat this process using the first item from the to-visit queue. This results in breadth-first traversal (compared to digging down into the first link found, which would result in depth-first behavior).

If you avoid non-local links and dig down only into local Web pages, you provide a Web crawler for a single Web site, as shown in Listing 7. In this case, I switch from Ruby to Python to take advantage of Python's useful HTMLParser class.


Listing 7. Simple Python Web site crawler (minispider.py)

#!/usr/local/bin/python

import httplib
import sys
import re
from HTMLParser import HTMLParser


class miniHTMLParser( HTMLParser ):

  viewedQueue = []
  instQueue = []

  def get_next_link( self ):
    if self.instQueue == []:
      return ''
    else:
      return self.instQueue.pop(0)


  def gethtmlfile( self, site, page ):
    try:
      httpconn = httplib.HTTPConnection(site)
      httpconn.request("GET", page)
      resp = httpconn.getresponse()
      resppage = resp.read()
    except:
      resppage = ""

    return resppage


  def handle_starttag( self, tag, attrs ):
    if tag == 'a':
      newstr = str(attrs[0][1])
      if re.search('http', newstr) == None:
        if re.search('mailto', newstr) == None:
          if re.search('htm', newstr) != None:
            if (newstr in self.viewedQueue) == False:
              print "  adding", newstr
              self.instQueue.append( newstr )
              self.viewedQueue.append( newstr )
          else:
            print "  ignoring", newstr
        else:
          print "  ignoring", newstr
      else:
        print "  ignoring", newstr


def main():

  if sys.argv[1] == '':
    print "usage is ./minispider.py site link"
    sys.exit(2)

  mySpider = miniHTMLParser()

  link = sys.argv[2]

  while link != '':

    print "\nChecking link ", link

    # Get the file from the site and link
    retfile = mySpider.gethtmlfile( sys.argv[1], link )

    # Feed the file into the HTML parser
    mySpider.feed(retfile)

    # Search the retfile here

    # Get the next link in level traversal order
    link = mySpider.get_next_link()

  mySpider.close()

  print "\ndone\n"

if __name__ == "__main__":
  main()

The basic design of this crawler is to load the first link to check onto a queue. This queue serves as the next-to-interrogate queue. As a link is checked, any new links that are found are loaded onto the same queue. This provides a breadth-first search. I also maintain an already-viewed queue and avoid digging into any link that I've seen in the past. That's pretty much it, with much of the real work being done by the HTML parser.

First, I derive a new class, called miniHTMLParser, from Python's HTMLParser class. The class does a few things. First, it's my HTML parser, with a callback method (handle_starttag) whenever a start HTML tag is encountered. I also use the class to access links encountered in the crawl (get_next_link) and to retrieve the file represented by the link (in this case, an HTML file).

Two instance variables are contained within the class, viewedQueue, which contains the links that have been investigated thus far, and instQueue, which represents the links that are yet to be interrogated.

As you can see, the class methods are simple. The get_next_link method checks to see if the instQueue is empty and returns ''. Otherwise, the next item is returned via the pop method. The gethtmlfile method uses HTTPConnectionK to connect to a site and return the contents of the defined page. Finally, handle_starttag is called for every start tag in a Web page (that is fed into the HTML parser via the feed method). In this function, I check to see if the link is a non-local link (if it contains http), if it is an e-mail address (via mailto), and also if the link contains 'htm', indicating (with high probability) that it's a Web page. I also check to make sure that I haven't visited it before, and, if not, the link is loaded into my interrogate and viewed queues.

The main method is simple. I create a new miniHTMLParser instance and start with the user-defined site (argv[1]) and link (argv[2]). I grab the contents of the link, feed it into the HTML parser, and grab the next link to visit, if one exists. The loop then continues while there are links remaining to visit.

To invoke the Web spider, you provide a Web site address and a link:

./minispider.py www.fsf.org /

In this case, I'm requesting the root file from the Free Software Foundation. This command results in Listing 8. You can see the new links that are added to the interrogation queue and those that are ignored, such as non-local links. At the bottom of the listing, you can see the interrogation of the links found in the root.


Listing 8. Output from the minispider script

[mtj@camus]$ ./minispider.py www.fsf.org /

Checking link  /
  ignoring hiddenStructure
  ignoring http://www.fsf.org
  ignoring http://www.fsf.org
  ignoring http://www.fsf.org/news
  ignoring http://www.fsf.org/events
  ignoring http://www.fsf.org/campaigns
  ignoring http://www.fsf.org/resources
  ignoring http://www.fsf.org/donate
  ignoring http://www.fsf.org/associate
  ignoring http://www.fsf.org/licensing
  ignoring http://www.fsf.org/blogs
  ignoring http://www.fsf.org/about
  ignoring https://www.fsf.org/login_form
  ignoring http://www.fsf.org/join_form
  ignoring http://www.fsf.org/news/fs-award-2005.html
  ignoring http://www.fsf.org/news/fsfsysadmin.html
  ignoring http://www.fsf.org/news/digital-communities.html
  ignoring http://www.fsf.org/news/patents-defeated.html
  ignoring /news/RSS
  ignoring http://www.fsf.org/news
  ignoring http://www.fsf.org/blogs/rms/entry-20050802.html
  ignoring http://www.fsf.org/blogs/rms/entry-20050712.html
  ignoring http://www.fsf.org/blogs/rms/entry-20050601.html
  ignoring http://www.fsf.org/blogs/rms/entry-20050526.html
  ignoring http://www.fsf.org/blogs/rms/entry-20050513.html
  ignoring http://www.fsf.org/index_html/SimpleBlogFullSearch
  ignoring documentContent
  ignoring http://www.fsf.org/index_html/sendto_form
  ignoring javascript:this.print();
  adding licensing/essays/free-sw.html
  ignoring /licensing/essays
  ignoring http://www.gnu.org/philosophy
  ignoring http://www.freesoftwaremagazine.com
  ignoring donate
  ignoring join_form
  adding associate/index_html
  ignoring http://order.fsf.org
  adding donate/patron/index_html
  adding campaigns/priority.html
  ignoring http://r300.sf.net/
  ignoring http://developer.classpath.org/mediation/OpenOffice2GCJ4
  ignoring http://gcc.gnu.org/java/index.html
  ignoring http://www.gnu.org/software/classpath/
  ignoring http://gplflash.sourceforge.net/
  ignoring campaigns
  adding campaigns/broadcast-flag.html
  ignoring http://www.gnu.org
  ignoring /fsf/licensing
  ignoring http://directory.fsf.org
  ignoring http://savannah.gnu.org
  ignoring mailto:webmaster@fsf.org
  ignoring http://www.fsf.org/Members/root
  ignoring http://www.plonesolutions.com
  ignoring http://www.enfoldtechnology.com
  ignoring http://blacktar.com
  ignoring http://plone.org
  ignoring http://www.section508.gov
  ignoring http://www.w3.org/WAI/WCAG1AA-Conformance
  ignoring http://validator.w3.org/check/referer
  ignoring http://jigsaw.w3.org/css-validator/check/referer
  ignoring http://plone.org/browsersupport

Checking link  licensing/essays/free-sw.html
  ignoring mailto:webmaster

Checking link  associate/index_html
  ignoring mailto:webmaster

Checking link  donate/patron/index_html
  ignoring mailto:webmaster

Checking link  campaigns/priority.html
  ignoring mailto:webmaster

Checking link  campaigns/broadcast-flag.html
  ignoring mailto:webmaster

done

[mtj@camus]$

This example demonstrates the crawling phase of a Web spider. After a file is read by the client, the page could also be scanned for content, as in the case of an indexer.



Back to top


Linux spidering tools

You've now seen how to implement a couple of scrapers and a spider. Linux tools that can also provide this functionality for you.

The wget command, which stands for Web get, is a useful command for recursively working through a Web site and grabbing content of interest. You can specify a Web site, content that you're interested in, and some other administrative options. The command then sucks down the files to your local host. For example, the following command will connect to your defined URL and recursively walk down no more than three levels and grab any file with an extension of mp3, mpg, mpeg, or avi.

wget -A mp3,mpg,mpeg,avi -r -l 3 http://<some URL>

The curl command operates in a similar way. Its advantage is that it's actively developed. Other similar commands that you can use are snarf, fget, and fetch.



Back to top


Legal issues

There have been lawsuits for data mining on the Internet using Web spiders, and they've not gone well. Farechase, Inc. was recently sued by American Airlines for screen scraping (done in real-time). The lawsuit first claimed that gathering the data violated American Airlines' users' agreement (found under Terms and Conditions). When that wasn't successful, American Airlines claimed a form of trespass, which was successful. Other lawsuits claim that the bandwidth taken by the spiders and scrapers detracts from legitimate users. All are valid claims and make politeness policies all the more important. See the Resources section for more information.



Back to top


Going further

Crawling and scraping the Web can be fun and, for some, extremely profitable. But, as previously discussed, there are legal issues. When spidering or scraping, always obey the robots.txt file available on the server and incorporate it into your politeness policy. Newer protocols, such as SOAP, make spidering much easier and less intrusive to normal Web operations. Future endeavors, such as the semantic Web, will make spidering even simpler, so the solutions and methods of spidering will continue to grow.



Resources

Learn

Get products and technologies
  • Searchtools.com's Source Code for Web Robot Spiders provides source code for free open source robots in several languages for a number of tasks.

  • Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

  • With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.


Discuss


About the author

M. Tim Jones is an embedded software engineer and the author of GNU/Linux Application Programming, AI Application Programming (now in its second edition), and BSD Sockets Programming from a Multilanguage Perspective. His engineering background ranges from the development of kernels for geosynchronous spacecraft to embedded systems architecture and networking protocols development. Tim is a Consultant Engineer for Emulex Corp. in Longmont, Colorado.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top


DB2, Lotus, Rational, Tivoli, and WebSphere are trademarks of IBM Corporation in the United States, other countries, or both. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft is a trademark of Microsoft Corporation in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.