Indexing Deep Web Content
By Web Ignite Corporation
February
6, 2002
As the World Wide Web morphs from a limited database of text documents into a much wider collection of files, it's important to know more about indexing deep Web content.
Until recently, most search engines indexed only a fraction of Web content -- pages easily found and spidered by traditional search engines. Now the focus is shifting to the deep Web, which includes a vast depository of underlying content in dynamic databases -- untapped due to the limitations of Web crawling technology.
The difference between the surface and deep Web is both qualitative and quantitative. Qualitatively, the deep Web includes images, sounds, and presentations (PDF, Excel, Power Point, audio, images and many other formats) that are invisible to search engine spiders. Quantitatively, the deep Web has been estimated to be about 500 times larger than the surface Web, but this figure may be misleading.
How Deep?
A BrightPlanet study conducted in March 2000 estimated that public information in the deep Web is about 500 times larger than that in the World Wide Web, and contains "billions of high-quality documents in about 350,000 specialty databases hidden from the view of standard search engines." However, Quigo CEO Yaron Galai feels some of these documents may not be relevant for indexation. "Although the BrightPlanet study is probably academically correct, our estimate is that the useful deep Web is two to three times the size of the surface Web," said Galai. "For example, sites like Barnes & Noble can generate millions of variations of the same page (due to personalization systems, price changes, etc.). While each variation on the Harry Potter page can be considered a unique page, Quigo normalizes these variations, indexing only one relevant Harry Potter page."
As the value of its content heightens the importance of the deep Web, we've seen technological advances enabling search engines to access these hidden resources. This has resulted in the indexing of more than 100,000 large dynamic sites, which have proven extremely helpful for researchers, businesspeople, academics, and consumers.
Framed Site Visibility
Many search engines can't crawl URLs that contain the characters "?" and "&", which are used to separate common gateway interface (CGI) variables; however, Google and Inktomi began to index framed sites recently, and others will follow.
In the meantime, technicians get around this dilemma by creating static versions of the site's dynamic pages for search engine crawlers, a solution that takes time, a lot of work, and continuous maintenance. A better strategy is to rewrite your dynamic URLs in a syntax that search engines can crawl. For more info, see Spider Food's Dynamic Web Page Optimization.
Dynamic Site Indexing
A number of search engines now index dynamic content:
- Last November, FAST relaunched its crawler technology, which indexes all kinds of dynamic content.
- Google indexes dynamic content and supports hundreds of file formats, including PDF, RTF, PostScript, Word, Excel, and PowerPoint.
- Inktomi indexes dynamic content through a paid partnership program, but only goes a few pages deep within each site unless sites pay to index more pages.
- AltaVista also indexes dynamic content. Basic Submit is free and Express Inclusion requires a fee. However, AltaVista doesn't like to index pages that change often because by the time these pages are indexed, the content is no longer fresh.
Paid Inclusion Programs
Premium services for indexing dynamic sites include AltaVista, Inktomi and FAST, to name a few. Note that generally these services do not influence your site's position or rank the way PPC engines do.
AltaVista's Trusted Feed service is ideal for the submission of Web pages that are difficult to crawl. It allows businesses to submit 500 or more URLs via an eXtensible Markup Language (XML) feed directly into the AltaVista index. Partners receive detailed performance reports for each URL submitted to help calculate your return on investment (ROI).
Inktomi also has a premium inclusion program that indexes framed and dynamic pages, and allows partners to determine which pages to include in the database.
FAST AllTheWeb's paid-inclusion PartnerSite service guarantees entry in the engine's databases for a per-URL fee (through partners such as Lycos). This service includes a 24-hour refresh and distribution to all FAST's customers.
The above programs operate on a pay-per-click (PPC) model, and compete directly with Overture and other PPC programs. Note that such paid services require close tracking and monitoring to ascertain the best ROI.
Deep Web Search Tools
There are a number of products and services that enhance deep Web searching, including BrightPlanet, Invisible Web (Intelliseek), ProFusion, Quigo, Search.com (C|Net) and Vivisimo. At present, Quigo is the only one with the capability to retrieve, normalize and index documents in an offline crawling process. The others focus on expanding the meta-search engine concept, enabling users to submit queries to thousands of sites (rather than the dozen or so through regular meta-search engines).
Vivisimo and Quigo can be used to elaborate on the technologies used. Vivisimo uses document clustering (the automatic organization of documents into groups) to organize unstructured information into hierarchical folders. Document clustering differs from other techniques (e.g., classification, taxonomy building, etc.) in that it is fully automated and requires no human intervention at any point (except for writing the basic algorithms).
The biggest challenge for document clustering is to quickly find meaningful groups that are concisely annotated. Vivisimo's clustering algorithm achieves good results on Web pages, patent abstracts, newswires, meeting transcripts, and television transcripts with little or no customization. Vivisimo's Clustering Engine is sold to enterprises and OEMs for installation on top of their existing search engines.
Quigo's technologies operate behind the scenes to allow portals and search engines to access and manage dynamic Web content that those engines using legacy crawling techniques do not normally index.
In addition to mapping the pages that aren't indexed by traditional search engines, Quigo also uses Information Extraction (IE) algorithms to restructure the information within a page, while keeping each piece of data within its relevant context. This restructuring ability is demonstrated in two search features:
1. Categorized results -- Upon each query, the user is presented with a list of all relevant categories in which results were found. This enables quick refinement, pinpointing the most relevant information, i.e., a search for "ford" presents categories such as person, car, actor, and company.
2. Associative search -- Certain attributes are hyperlinked in each Quigo search result. By clicking the linked words, users can move within Quigo's deep Web database to locate other similar documents. For example, a search for "Jurassic Park" will bring up several book sites. Click the author's name, and Quigo displays all books found that were authored by Michael Crichton.
This ability to restructure data can be expanded in many ways. For example, it enables users to sort articles by date, to conduct comparison shopping, to locate the best performing stocks, to find the closest coffee shop by zip code, and so forth.
Will the ability to feed the huge dynamic-site database into each portal's own database (enabling users to search deeply and directly for specific information) spawn some new monetizing scheme? If Overture's success is any indication, technologies like Quigo can enable portals to monetize their search traffic exponentially.
While today's PPC engines that use manual page submission and bidding are attractive for smaller sites, technologies such as Quigo could enable large scale collaborations (either as PPC or paid-inclusion) between portals and leading Websites. Since deep Websites are usually among the most authoritative sites on the Internet, this solution may also overcome the relevance issue, providing a highly useful service for the user as well. Below is a brief description of Quigo's products.
DeepWebGateway enables search engines to index deep Web content that they can't access directly. The DeepWebGateway serves as a retrieval system that locates, retrieves and normalizes deep Web pages. These HTML pages are then processed into highly structured XML documents and fed to the partner search engine. This technology also solves other problems related to deep Web crawling and indexing, such as spider traps and personalization.
DeepWebSonar is Quigo's end-to-end solution for large portals, and provides page indexing, query analysis, and a unique ranking system. By redirecting user queries to DeepWebSonar, portals can immediately offer deep Web results to their users. DeepWebSonar can be customized to search within specific sites, categories, etc., making it highly attractive for vertical search solutions (e.g., biotech search) and for searching within a network of partner sites.
The Future - 100% Visibility
In closing, it's apparent that Web search continues to morph as the Web matures. Providing large dynamic Web sites with 100 percent visibility is an idea whose time has come. Portals, engines and directories that provide the best services to users are those that will prosper, perhaps redefining the nature of Web search in the 21st century.
