Skip to main content

skip to main content

developerWorks  >  Web development  >

Search engine optimization basics, Part 3: Get your Web pages into search indexes

A search engine can't find what it doesn't see

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Introductory

Mike Moran (mikmoran@us.ibm.com), Manager of Site Architecture, ibm.com, IBM
Bill Hunt (billhunt@us.ibm.com), President/CEO, Global Strategies International, LLC

21 Mar 2006

Making your Web site attractive to search engines is a key factor for your success as a Web site developer. Get the basic information you need to organically optimize your Web site in this four-part series. In Part 3 of the series, you'll learn how to get the pages of your Web site into the search indexes.

Web searching is hot and getting hotter. Three-quarters of Web users use search regularly and 64 percent of Web users employ search as their primary method of finding things (see Resources for links to studies). Are these users finding your site? Or is your Web site missing in action?

In the first half of this series, Jennette Banks provided an overview of search marketing in Part 1 and the basics of keyword planning and optimization in Part 2.

Here, in Part 3, we focus on what you need to know to get the pages of your Web site into the search indexes -- those databases that search engines such as Google and Yahoo!® use when a searcher looks for something. If a Web page is missing from a search engine's index, that engine will never find it, so adding your pages to the index is a crucial step in your SEO success.

How many pages from my site are indexed?

If you're anxious to know how well your Web site is indexed, start with a simple test. Go to Google or your own favorite search engine and search for your company's name. If it's a common name (such as AAA Plumbing or Acme Industries), add your location (AAA Plumbing Peoria) or your best product (Acme Industries sheet metal) and see if your site is found.

Once in a great while, a Web site is totally missing from a search index, usually for one of two reasons:

  • Your site is new. If your Web site is just getting off the ground and no other site in the search index links to yours, the search engines haven't discovered it . In this case, just get some other sites to link to yours.
  • Your site is banned. If your site is implicated in unethical (or black hat) SEO practices by the search engines, all of your pages might be removed from their indexes. If you find yourself in this difficult position, find a search marketing expert to analyze your site and identify your offense, and then petition the search engines to give you a reprieve after you fix it.

If you're lucky, when you entered your company's name into your favorite search engine, you found at least one page from your Web site. It's typical for only some of your pages to be indexed by any particular search engine, but it's better if almost all of them are indexed. Every missing page means potential visitors to your site are instead shown your competitors' pages that are indexed.

Inclusion ratio

To start, check your inclusion ratio, a fancy term for the percentage of your total pages indexed by a search engine. The ideal is an inclusion ratio of 100 percent, of course, but you might have to settle for less. If fewer than 50 percent of your pages are included in search indexes, then you have some serious work to do.

To calculate your inclusion ratio, divide the number of pages in a search engine's index by the total number of pages on your site. If you have a relatively small Web site, you can probably estimate your site's total page count rather easily, but large sites are sometimes at a loss to know how many pages they have. For large sites, you can use several methods to estimate your page count:

  • Ask your Web master. Your Web master has undoubtedly been asked this before and might have researched the question.
  • Count the documents in your content management systems. Typically each document creates a unique page, so this gives you an estimate.
  • Use a tool. Programs such as OptiSpider™ or Xenu examine your Web site and report how many pages it finds (see Resources).

Once you estimate the size of your Web site, you're ready to find out how much of it is indexed. Google, Yahoo! Search, and MSN Search each provide a "site:" operator that tells you what you need to know. Enter site: followed by your domain name (such as site:kodak.com) and check the number of results returned. Easier still is Marketleap's free Saturation Reporting Tool (see Resources), which shows you any site's page count in each search index.



Back to top


The spider path to success

If calculating your inclusion ratio revealed bad news, what can you do? First let's review how search engines index your pages. As we've discussed, search engines use specially designed programs called spiders (or crawlers) to examine pages on sites.

The spider scoops up the HTML for each page, noting links to other pages so it can come back to collect the HTML of those pages later. You can imagine that, given enough time, a spider can eventually find every page on the Web (or at least every page that is linked to another page). The process of getting a page, finding all of the links on that page, and then getting those pages in turn is called crawling the Web.

Given how spiders work, your best bet is to make it simple for your site to be indexed by creating links to every page -- we call this technique spider paths. Your site already contains paths, and probably already has the most important kind of spider path: your site map. If your site consists of only a few dozen pages, your site map can list every page on the site and link to each one.

Site maps shouldn't exceed 100 links, however, so larger site maps must link to category hub pages that themselves link to the remaining pages on the site. The largest Web sites are typically divided into country sites, which demand their own flavor of site map called a country map that lists each country name with a link to the home page of its country site. Spiders eat up this kind of technique. (See Resources for examples of larger site maps.)

Site maps rely on spiders coming to call on your site, but more aggressive methods can land your pages in the search index, too. Both Google and Yahoo! offer inclusion programs specially designed to get your pages indexed. Google's beta program, called Sitemaps (see Resources), costs nothing and provides several ways to tell Google's spider where your pages are. You can even request that Google index some of your pages more frequently than others. Yahoo! offers a paid inclusion program called SiteMatch (see Resources) that promises to reindex your pages within 48 hours. (Google makes no promises about timeliness.)

RSS feeds provide yet another way to get your pages indexed quickly whenever they are published. Use Ping-O-Matic! (see Resources) to alert search engines to a new item in your RSS feed, and usually it will be indexed in a day or two.



Back to top


Clearing your spider paths

Hikers depend on trailblazers to establish and mark new hiking paths, but those trailblazers must regularly clear the paths so they don't fall into disrepair and disuse. Spider paths are no different; unless they are regularly checked, they are likely to become blocked.

Spider paths can easily turn into spider traps when you ignore how spiders work. Pages that work just fine for people can thwart spiders. Spiders are automated, so there's no human visitor to fill out a registration form. If linking to a page on your site requires anything more than following an HTML anchor tag, that link may be hidden from the spider.

That means that JavaScript, Flash, frames, and cookies can also cause problems. If your Web page can't display at all without these technologies, your pages won't be indexed by the spider. Moreover, if users need any of these technologies to follow the links, the spider won't be able to do so.

Spiders see only the HTML coding, just as screen readers work for visually impaired people. To simulate what a spider sees, you can disable your browser's support of cookies, JavaScript, and graphics when you view a page, or you can use the text-only Lynx browser or Lynx Viewer (see Resources). If your page is adequately displayed using Lynx, then your page can probably be indexed. Pages that don't display at all or are substantially incomplete aren't easily found by search engines.

Even if you avoid troublesome technologies, you might still cause trouble for the spider. Spiders are real sticklers for correct HTML coding -- browsers are much more lax. Pages that seem fine in a browser can trip up a spider, which loses or misinterprets part or all of your page. An HTML validation service (see Resources) and the Firefox browser can find these errors.

You must also pay attention to the size limits spiders impose on each page's content. Most spiders will index only the first 100,000 characters of your page, which sounds quite large, but can be quickly eaten up if you've dumped JavaScript programs and style sheets into your page, or if you've stuck your entire user manual into a single PDF file. Instead, break up your manuals into separate PDFs for each chapter and move all JavaScript and style sheet coding to external files.



Back to top


Letting in the spider

Once your spider paths are cleared, you must ensure the spider is welcome. The most obvious advice is to make sure your site is up and responding when the spider arrives. Because you don't know when the spider will visit, it's risky to have frequent down times or "maintenance windows" -- the spider will move on to another site.

Almost as bad as a complete outage is very slow response time because spiders are on a schedule. They index fewer pages and return less frequently to sluggish sites because they can index more pages in the same amount of time elsewhere.

If your site is typically available and speedy, you can still inadvertently send the spider packing by miscoding your robots instructions. You can tell spiders to stay away from pages, directories, or your whole site using a robots.txt file, so your site's instructions can mistakenly tell the spider to go away. In addition, each page can encode a robots tag that instructs the spider about whether to index that page and follow links from it. (See Resources.)



Back to top


Getting the spider to stick around

Even if your site lets in the spider, it's no guarantee that it won't abandon your site later.

One sure way to push away the spider is to use long, dynamic URLs for your pages. Many dynamic URLs require parameters to choose the right content to show, such as the French description of product 2372 from the Canada product catalog. Spiders are skittish about these dynamic sites because the combinations of parameters can be almost limitless -- the spider doesn't want to get lost within your site. When spiders see URLs of more than 1,000 characters or more than two parameters, they tend to skip those pages.

If your site contains some of these problem URLs, you must consult the documentation for your Web server to change the appearance of the URLs to make the spider happy. For example, Apache uses the "mod_rewrite" capability (Resources) to change these URLs, and other Web servers have similar capabilities .

So-called session identifiers also scare spiders away. Some programmers create a parameter in the URL to capture information about which visitor is viewing the page (usually identified with an "id=" followed by a unique alphanumeric code). Spiders hate this technique because it results in the same content being displayed for hundreds or thousands of different URLs. Instead, your programmers should store this information in the session layer of your Web application server or in a cookie. (But never require a cookie to display the page, as we discussed earlier, or the spider won't be able to index it.)

Once you analyze your dynamic pages, check out a possible problem that might afflict any page on your site. A redirect is a technique that tells a browser and a spider that the requested URL has changed. If your company changes its name, for example, it will probably change its Web site domain name as well, so redirects allow all of the visitors following links from the old URLs to go to the new ones. But only one method of redirection works for spiders: server-side redirects, also known as 301 redirects (Resources). Other redirection techniques that work just fine for browsers, such as meta-refresh redirects and JavaScript redirects aren't followed by spiders, leaving those redirected pages out of the search index.



Back to top


In summary

It seems obvious that your pages must be indexed before a search engine can find them, but most pages aren't indexed. Back in 1999, it was estimated that 16 percent of all Web pages were indexed by search engines, but over the years, the estimate has plummeted: By 2001, it was estimated that only 0.03 percent of the Web was indexed.

Given these bleak statistics, it's likely that your site has many pages missing from the search indexes, and those pages can never be found. Now you know how to solve this problem.

Getting into the search index is not all you need to do, however. In part 4 of this SEO series, we discuss some search marketing issues typical of large Web sites, such as how to optimize dynamic pages, work across multiple country sites, and get large teams to work together.



Resources

Learn

Get products and technologies
  • OptiSpider (which is $98) or Xenu (a freeware application): Find out how many pages your site has.

  • Marketleap's free Saturation Reporting Tool: Get the page count for any site in each search index.

  • Google Sitemaps: Try the free beta version of this inclusion program.

  • Yahoo! Small Business SiteMatch: Explore this paid inclusion program with a promise to reindex your pages within 48 hours.

  • Ping-O-Matic!: Alert search engines to any new items in your RSS feed.

  • Lynx browser: See how your site looks to a page reader (and search engine spider) with this text-only browser.

  • Lynx Viewer: Use this tool if you don't want to download the Lynx browser.

  • W3C Markup Validation Service: Validate your HTML and XHTML Web documents with this free service that checks for conformance to W3C Recommendations and other standards.


Discuss


About the authors

Photo of Mike Moran

Coauthor of the book Search Engine Marketing, Inc., Mike Moran is an IBM Distinguished Engineer with more than 20 years experience in search technology working at IBM Research, Lotus, and other IBM software units. He led the product team that developed the first commercial linguistic search engine in 1989, and has been granted four patents in search and retrieval technology. He led the original search marketing strategy for ibm.com, as well as the integration of ibm.com's site search technologies. Beyond his search work, Mike has spearheaded ibm.com projects in content management, personalization, and Web metrics. Mike is currently the Manager of ibm.com Web Experience, responsible for the site's design, information architecture, technical architecture, and operation.


Photo of Bill Hunt

Bill is responsible for a team of Search Engine Marketing Strategists who help Fortune 200 companies manage their enterprise SEM programs with a global perspective. Bill is currently regarded as a leading expert in both enterprise and international SEM strategy, and is the coauthor of the highly acclaimed book Search Engine Marketing, Inc., published by IBM Press. Bill earned a B.A. in Asian studies and Japanese from the University of Maryland, Tokyo Campus, and a B.S. in international business from California State University, Los Angeles. He is also a veteran of the Marine Corps.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top