Skip to Main Content

Finding Information on the Web: Invisible Web

Using the Internet for Research? How can you tell if the stuff on the Web is good to use in your class assignment?
Subjects: Computer Science

Invisible Web

THE INVISIBLE WEB

There's lots of helpful information locked away in databases that can never be indexed by search engines because this information is buried deep under many layers of web pages.  LARGE amounts of other information is located in databases that have subscription fees.  This information is contained in the "invisible web" or "deep web."   NO search engine can reach it or search it.  A good searcher should be aware of this type of information.  

See:  BrightPlanet's white paper on why you should care about the deep web

See Berkeley’s page on what search engines will not find:  http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html

WikiPedia has an extensive page on the Deep Web:   http://en.wikipedia.org/wiki/Deep_web

Wendy Boswell at About.Com has a great guide to the Invisible Web:  http://websearch.about.com/od/invisibleweb/a/invisible_web.htm

Examples of Deep Web sites:

  • Library subscribed databases like Academic Search at Kinlaw Library
  • U.S. Government data contained deep on their web sites (like census.gov)
  • Professional subscription databases (company or stock info or at association web pages)
  • Individual articles at subscription magazine or newspaper sites (like New York Times)

Fact Sheet

FACT SHEET from BRIGHTPLANET 2010 about the Deep Web - Original URL: http://brightplanet.com/resource-library/fact-sheet/

What is Deep Data?

Deep Data is unknown or hidden data that cannot be easily found using conventional search engines. Deep Data is content that exists deep within sites which require Deep Web harvesting techniques to uncover. Deep Data exists on the Open Source Public Web, on proprietary websites and within private databases.

What is the Deep Web?

The Deep Web is that part of the web housing content that is only accessible when “asked for” through a custom query (which cannot be accomplished by a simple surface search query such as Google).

Typically, this content cannot easily be found using link traversal techniques as employed through traditional search engine crawlers. Based on some studies, the Deep Web is at least 1000 times greater than the Surface Web, leaving the bulk of all “searchable” information out of a common search engines’ reach.    Exploring a ‘Deep Web’ that Google can’t Grasp: NYT 2-23-09

Can Google, Yahoo!, MSN, Cuil and others find Deep Data on the web?

Surface search results are based on “relevancy by popularity”, ranked by total “hits” by users’ simple search queries.

While search engines can “find” deep data, their coverage is often sporadic and intermixed with less relevant (and too much) content. To find exactly the content needed, a user must traverse through “all” content within each surface site (Google, MSN, etc.).

Further, for a researcher to find Deep Data using Surface Search Engines, they must rely on their own content expertise and personal ability to navigate the web “one click at a time”, (link traversal) - a time-consuming process which has become normal behavior when using standard search engines.

What is Google Missing?

The Surface Web contains only a fraction of the overall content available on-line today. Of the top 5 surface search engines, Google represents only 63% of the total indexed content of the Surface Web alone! (See latest rankings)

Limiting search to a single source (like Google), will produce a one-dimensional set of results. Harvesting from many sources, 10 to 20 or even 100, will yield far more documents and far more relevant content.

Google, most likely, will not contain the most recent version of a document. Further, there is no way to “refresh” a Google search. Google will often have false positive hits - content that matches your query but is not relevant to your search. Additionally, Google cannot distinguish a page of links from a page of content.

Find the Best for your needs