There's lots of helpful information locked away in databases that can never be indexed by search engines because this information is buried deep under many layers of web pages. LARGE amounts of other information is located in databases that have subscription fees. This information is contained in the "invisible web" or "deep web." NO search engine can reach it or search it. A good searcher should be aware of this type of information.
See: BrightPlanet's white paper on why you should care about the deep web.
See Berkeley’s page on what search engines will not find: http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html
WikiPedia has an extensive page on the Deep Web: http://en.wikipedia.org/wiki/Deep_web
Wendy Boswell at About.Com has a great guide to the Invisible Web: http://websearch.about.com/od/invisibleweb/a/invisible_web.htm
Examples of Deep Web sites:
FACT SHEET from BRIGHTPLANET 2010 about the Deep Web - Original URL: http://brightplanet.com/resource-library/fact-sheet/
What is Deep Data?
Deep Data is unknown or hidden data that cannot be easily found using conventional search engines. Deep Data is content that exists deep within sites which require Deep Web harvesting techniques to uncover. Deep Data exists on the Open Source Public Web, on proprietary websites and within private databases.
What is the Deep Web?
The Deep Web is that part of the web housing content that is only accessible when “asked for” through a custom query (which cannot be accomplished by a simple surface search query such as Google).
Typically, this content cannot easily be found using link traversal techniques as employed through traditional search engine crawlers. Based on some studies, the Deep Web is at least 1000 times greater than the Surface Web, leaving the bulk of all “searchable” information out of a common search engines’ reach. Exploring a ‘Deep Web’ that Google can’t Grasp: NYT 2-23-09
Can Google, Yahoo!, MSN, Cuil and others find Deep Data on the web?
Surface search results are based on “relevancy by popularity”, ranked by total “hits” by users’ simple search queries.
While search engines can “find” deep data, their coverage is often sporadic and intermixed with less relevant (and too much) content. To find exactly the content needed, a user must traverse through “all” content within each surface site (Google, MSN, etc.).
Further, for a researcher to find Deep Data using Surface Search Engines, they must rely on their own content expertise and personal ability to navigate the web “one click at a time”, (link traversal) - a time-consuming process which has become normal behavior when using standard search engines.
What is Google Missing?
The Surface Web contains only a fraction of the overall content available on-line today. Of the top 5 surface search engines, Google represents only 63% of the total indexed content of the Surface Web alone! (See latest rankings)
Limiting search to a single source (like Google), will produce a one-dimensional set of results. Harvesting from many sources, 10 to 20 or even 100, will yield far more documents and far more relevant content.
Google, most likely, will not contain the most recent version of a document. Further, there is no way to “refresh” a Google search. Google will often have false positive hits - content that matches your query but is not relevant to your search. Additionally, Google cannot distinguish a page of links from a page of content.
Use a better search strategy
Choose the best search for your purpose
Search Engine Features
Anatomy of a URL