Print

Print


                          FEATURE ARTICLE
        http://www.freepint.co.uk/issues/080600.htm#feature

                        "The Invisible Web"
                          By Chris Sherman

There's a big problem with most search engines, and it's one many
people aren't even aware of. The problem is that vast expanses of the
Web are completely invisible to general purpose search engines like
AltaVista, HotBot and Google. Even worse, this "Invisible Web" is in
all likelihood growing significantly faster than the visible Web
you're familiar with.

So what is this Invisible Web and why aren't search engines indexing
it?  To answer this question, it's important to first define the
"visible" Web, and describe how search engines compile their indexes.

The Web was created a little over ten years ago by Tim Berners-Lee, a
researcher at the CERN high-energy physics laboratory in Switzerland.
Berners-Lee designed the Web to be platform-independent, so that
researchers at CERN could share materials residing on any type of
computer system, avoiding cumbersome and potentially costly conversion
issues.  To enable this cross-platform capability, Berners-Lee created
HTML, or HyperText Markup Language - essentially a dramatically
simplified version of SGML (Standard Generalized Markup Language).

HTML documents are simple: they consist of a "head" portion, with a
title and perhaps some additional meta data describing the document,
and a "body" portion, the actual document itself.  The simplicity of
this format makes it easy for search engines to retrieve HTML
documents, index every word on every page, and store them in huge
databases that can be searched on demand.

What's less easy is the task of actually finding all the pages on the
Web.  Search engines use automated programs called spiders or robots
to "crawl" the Web and retrieve pages.  Spiders function much like a
hyper-caffeinated Web browser - they rely on links to take them from
page to page.

Crawling is a resource-intensive operation.  It also puts a certain
amount of demand on the host computers being crawled.  For these
reasons, search engines will often limit the number of pages they
retrieve and index from any given Web site.  It's tempting to think
that these unretrieved pages are part of the Invisible Web, but they
aren't.  They are visible and indexable, but the search engines have
made a conscious decision not to index them.

In recent months, much has been made of these overlooked pages.  Many
of the major engines are making serious efforts to include them and
make their indexes more comprehensive.  Unfortunately, the engines
have also discovered through their "deep crawls" that there's a
tremendous amount of duplication and spam on the Web.  Current
estimates put the Web at about 1.2 to 1.5 billion indexable pages.
Both Inktomi and AltaVista have claimed that they've spidered most of
these documents, but have been forced to cull their indexes to cope
with duplicates and spam.  Inktomi puts the size of the distilled Web
at about 500 million pages; AltaVista at about 350 million.

But these numbers don't include Web pages that can't be indexed, or
information that's available via the Web but isn't accessible by the
search engines.  This is the stuff of the Invisible Web.

Why can't some pages be indexed?  The most basic reason is that there
are no links pointing to a page that a search engine spider can
follow.  Or, a page may be made up of data types that search engines
don't index - graphics, CGI scripts, Macromedia flash or PDF files,
for example.

But the biggest part of the Invisible Web is made up of information
stored in databases. When an indexing spider comes across a database,
it's as if it has run smack into the entrance of a massive library
with securely bolted doors. Spiders can record the library's address,
but can tell you nothing about the books, magazines or other documents
it contains.

There are thousands - perhaps millions - of databases containing
high-quality information that are accessible via the Web.  But in
order to search them, you typically must visit the Web site that
provides an interface to the database.  The advantage to this direct
approach is that you can use search tools that were specifically
designed to retrieve the best results from the  database.  The
disadvantage is that you need to find the database in the first place,
a task the search engines may or may not be able to help you with.

Another problem is that content in some databases isn't designed to be
directly searchable.  Instead, Web developers are taking advantage of
database technology to offer customized content that's often assembled
on the fly. Search engine results pages are an example of this type of
dynamically generated content - so are services like My Excite and My
Yahoo.  As Web sites get more complex and users demand more
personalization, this trend toward dynamically generated content will
accelerate, making it even harder for search engines to create
comprehensive Web indexes.

In a nutshell, the Invisible Web is made up of unindexable content
that search engines either can't or won't index.  It's a huge part of
the Web, and it's growing.  Fortunately, there are several reasonably
thorough guides to the Invisible Web.

Gary Price, Reference Librarian at the Gelman Library at George
Washington University, is considered one of the foremost authorities
on online databases and other invaluable search resources on the
Invisible Web. Price has assembled a massive collection of links to
Invisible Web resources at his Direct Search page
<http://gwis2.circ.gwu.edu/~gprice/direct.htm>.

"A good librarian would not start looking for a phone number
(specialized, Invisible Web info) by searching the Encyclopaedia
Britannica (general knowledge resource)," says Price. "Both
professional and casual searchers should at least be aware that they
could be missing some information or wasting time finding what could
be found more easily if the right tool for the job is easily
accessible. This is very similar to a good reference librarian
'knowing' the major reference tools in his or her collection."

What kinds of databases does Price consider to be essential Invisible
Web search tools?  He names four as examples:

- The many databases that make up GPO Access.
<http://www.access.gpo.gov/su_docs/aces/aaces002.html>

- Any of the telephone directory databases such as Anywho
<http://www.anywho.com/>, Switchboard <http://www.switchboard.com/>,
and Phone Net U.K. <http://www.bt.com/phonenetuk/>.

And two that are crucial to the business searcher:

- Any of the many flavors of EDGAR, particularly the 10K Wizard.
<http://www.tenkwizard.com/>
- The Mercury Center searchable version of the PricewaterhouseCoopers
Money Tree Survey of venture capital made available by the San Jose
Mercury News. <http://wwdyn.mercurycenter.com/business/moneytree/>

"In addition to text media, the Internet is serving up many other
formats. "One that interests me a great deal is streaming media. One
experimental project that is noteworthy is the Speechbot engine that
is being developed and tested by Compaq," says Price.
<http://speechbot.research.compaq.com/>

Two other Invisible Web resources Price maintains are his NewsCenter
<http://gwis2.circ.gwu.edu/~gprice/newscenter.htm>, which focuses on
sources providing up to the minute news stories on any subject
imaginable, and his Web Audio Current Awareness Resources page
<http://gwis2.circ.gwu.edu/~gprice/audio.htm>, with links to hundreds
of live and recorded audio/video news and public affairs programming
on the Web.

"By the way, do not mistake an interest in the Invisible Web as a slam
on the general search engines because it is NOT," says Price. "General
search tools are still 100% essential for accessing material on the
Internet."

One of the largest gateways to the Invisible Web is the aptly named
Invisibleweb.com <http://www.invisibleweb.com> from Intelliseek.
"Invisible Web sources are critical because they provide users with
specific, targeted information, not just static text or HTML pages,"
says Sundar Kadayam, CTO and Co-Founder, Intelliseek.
"InvisibleWeb.com is a Yahoo-like directory.  It is a high quality,
human edited and indexed, collection of highly targeted databases that
contain specific answers to specific questions," says Kadayam.

Intelliseek also makes BullsEye, a desktop based meta search engine
that can also access many of the sites included in InvisibleWeb.com.
More information can be found at
<http://www.intelliseek.com/prod/bullseye.htm>.

Other notable Invisible Web resources include:

AlphaSearch
<http://www.calvin.edu/library/searreso/internet/as/>
AlphaSearch is an extremely useful directory of "gateway" sites that
collect and organize Web sites that focus on a particular subject.
Created and maintained by the Hekman Library at Calvin College, it's
both searchable and browsable by either subject discipline or
descriptor.

The Big Hub
<http://www.thebighub.com/>
The Big Hub maintains a directory of over 1,500 subject specific
searchable databases in over 300 categories.  Listings for each
database feature both annotations and search forms to directly access
the database.  While these are useful for quick and dirty searches,
Big Hub's search forms omit most advanced searching features offered
by each database on their own site.

Infomine Multiple Database Search
<http://infomine.ucr.edu/search.phtml>
Infomine might be called an "academic" search engine, focusing on
scholarly resource collections, electronic journals and books, online
library card catalogs, and directories of researchers.  Unlike many
Invisible Web search tools, Infomine allows simultaneous searching of
multiple databases.

WebData.com
<http://www.webdata.com/>
WebData is a database portal, specializing in finding, categorizing
and organizing online databases, and providing annotated links with
quality rankings.

As fast as the Web has been growing over the past ten years, it's
likely that its growth rate is accelerating, perhaps exponentially.
Speaking at the NetWorld+Interop conference in May 2000, Inktomi CEO
David Peterschmidt said he expected the Web to grow to more than 8
billion documents by the end of the year - more than a fivefold
increase from its current size.

The major search engines have done a creditable job of scaling with
the visible Web.  For the foreseeable future, however, valuable
resources that are part of the Invisible Web will be beyond their
reach.  Fortunately, we have other workmanlike tools that can help us
navigate the portion of the Web that the search engines can't see.

> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Chris Sherman is the Web Search Guide for About.com,
<http://websearch.about.com>. Chris holds an MA from Stanford
University in Interactive Educational Technology and has worked in the
Internet/Multimedia industry for two decades, currently as President
of Searchwise.net, a Web consulting and training firm.  He's a
frequent contributor to information industry trade publications
including Online Magazine and Information Today. His email address is
[log in to unmask]

Source:
> = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

Free Pint (ISSN 1460-7239) is a free newsletter written by information
professionals who share how they find quality and reliable information
on the Internet.  Useful to anyone who uses the Web for their work, it
is published every two weeks by email.

To subscribe, unsubscribe, find details about contributing,
advertising or to see past issues, please visit the Web site at
http://www.freepint.co.uk/ or call +44 (0)1784 455 466.

> = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
This article rovided to selected readers under Section 107(a) of the
1976 U.S. Copyright Act