Search engines are a part of everyday life, but most of us have only a vague idea of how they work and little understanding of their limitations.
Search engines work by “crawling,” “indexing” and “searching.” They begin by collecting information about the pages available on the internet. This is accomplished by using a web crawler, an automated program that follows every link it encounters.
Next, the pages are analyzed and data that the search engine decides should be indexed is extracted. Some search engines, like Google, simply save entire pages in addition to information about the page.
Finally, when you use a search engine it actually combs its index for pages that match your search criteria. Since millions of pages might contain the keywords you’re using, many search engines employ some sort of ranking system to display the most authoritative, relevant or popular results first. One of more than 100 criteria Google uses is called PageRank. It ranks a page on a scale from 1 to 10 based on the quantity and quality of links on the web to the page in question. Basically, the more links to a page that exist and the higher the PageRank on the linking pages, the higher the PageRank of the linked page. Google’s early success was partly due to this system.
Even though all major search engines operate on the principle of crawling, billions of pages are overlooked using this method. All the pages that are reachable through search engines comprise the “surface web” and those that are unreachable comprise the “deep web.” In 2001, a report by the search company BrightPlanet, estimated that the deep web was 500 times the size of the surface web. There are three main reasons that a page might form a part of the deep web.
The first possibility is simple: if a website is not linked to by another website that has already been indexed, then a crawler will not reach it.
The second possibility is that the page, document, or file is in a format that the search engine can’t understand and, as such, can’t gather information for indexing.
The final, and probably the commonest reason, is that huge portions of the web are stored in databases. Imagine a huge warehouse filled will row after row of documents. If you want anything, you have to talk the employee behind the counter and tell them what you’re looking for. They go into the back, look around and bring any documents that fit your criteria back to the counter.
This is great if you looking for something in particular, but what if you wanted to take a look at every single document in the warehouse? That’s what a search engine would want to do. The system only works if you are looking for something in particular; you can’t just ask the employee for everything. That only leaves you with one option: conduct searches with every possible criteria.
This means you would have to ask the employee to conduct billions upon billions of searches to be sure you got everything. Not only will this take far too long to be realistic, you will also be monopolizing the employee. That is the problem crawlers have with databases. The crawlers would have to ask the database to perform an immense number of searches and could conceivably disrupt the database’s operation.
Some crawlers are being designed to somehow interact with databases with the goal of one day bringing the deep web to the surface.
Link O’ the Week: BugMeNot
A website with a database of user names and passwords to bypass the compulsory registration that many sites require.
Webcomic O’ the Week: Something Positive by R. K. Milholland
The foul smell of character development.
Free Application O’ the Week: Tight VNC
If you’ve ever had to help a computard, you’ll appreciate this desktop sharing program that allows you to remotely control another computer, with their permission, of course.