The first search makes the main job mechanism that prop up

The data center of engine of large Internet search runs thousands of commonly even 10 thousand tens of computers, and everyday to the computer tens of station machines are added in group, in order to maintain the synchronism with network development. Collect machine is automatic collect webpage information, average rate every second tens of webpage, retrieve what the machine offers fault tolerance to be able to shrink the user inquiry that counts in order to answer even to counting ten million everyday requests put systematic framework. The enterprise searchs engine to be able to use scale according to what differ, from only station the computer can undertake deploy to computer group.

Searching the working procedure with common engine is: The webpage that goes up to Internet above all undertakes collect, the webpage that comes to collect next undertakes pretreatment, build webpage index warehouse, real time answers the inquiry of the user to request, press some kind of regulation to make the user is returned after sort to checking the result that find. The main function that searchs engine is the text information that can go up to Internet offer full text to retrieve.

Search the working flow of engine

Search engine carries a program to receive those who come from an user to retrieve a request through the client, the commonnest now client carries a program is a browser, actually it also can be an user development is so simple that actually it also can be an user development much network uses a program. The retrieval that the user inputs requests is a keyword commonly perhaps use logistic symbol connective many keywords, search server basis systematic keyword dictionary, wordID of search keyword translate into, leave a library in mark next (file pouring a platoon) in get DocID list, to DocID list medium object undertakes scanning and undertake matching with WordID, extraction satisfies conditional webpage, calculate next of webpage and keyword relevant degree, spend related the basis the K before numerical value general piece result (different search engine paginal searchs result number to differ) return an user, if the graph is shown 1 times,its handle flow.

One, collect implement

Search engine passes Robot of a kind of program (weigh Spider again) , collect implement the function is roam in Internet, discover and collect information, the news type of its collect is varied, include information of documentation of file of article of documentation of HTML page, XML, Newsgroup, FTP, word processing, multimedia to wait. Searcher is program of a computer, often use now actually distributed with collateral processing technology, in order to improve information discovery and newer efficiency. The collect of commercial search engine implement everyday can collect millions of even more webpages. Searcher should keep moving commonly, want as far as possible much, as far as possible the new information of all sorts of types on quickly collect Internet. Because the information on Internet is newer very fast, update regularly even so already the old information that collect passes, in order to avoid dead link and invalid link. Additional, because Web information is of dynamic change, because of this collect implement, analyzer and index implement should update a database regularly, newer cycle is a few weeks about normally even a few months. Index database is bigger, also jump over difficulty newlier.

The information on Internet is too much, although the function is powerful collect implement the complete information on Internet of impossible also collect. Accordingly, collect implement the search strategy with be used particular has one by one to Internet and download documentation, for example, adopt range of in order to commonly preferential search strategy is given priority to, linear search strategy is complementary the search is politic.

In collect implement when implementation, an alignment exceeding cable length is safeguarded in the system, or stack, include a few initiative URL among them (

Understand to wait like Google Sitemap of catalog of DMOZ, Yahoo) , collect implement set out from these URL, download corresponding page, and from which draw-out go out to exceed catenary newly to join alignment or in stack, afore-mentioned processes repeat alignment to be till stack ceaselessly empty. To improve efficiency, search engine has Web space differentiating according to address of domain name, IP or national domain name, use many gather implement collateral job, let every searcher be in charge of the search of one stature space. Expand in the future to facilitate service, collect implement should change search range.

1, linear collect is politic

The basic idea with line form politic search is from initiative IP address sets out, the information in the each IP address that searchs sequel by the means that IP address increases by degrees, the HTML that takes no account of each site completely points to the address exceeding cable length of other Web site in the file. This strategy does not apply to extensive search (main reason depends on IP may be dynamic) , but the comprehensive search that can surround with Yu Xiaofan, the gather that uses this kind of strategy implement can discover be cited to perhaps return source of news of file of new without what be cited by other HTML file HTML lesser.

2, deepness is preferential collect is politic

Deepness is preferential collect strategy is inchoate development collect implement use a kind of more method, its purpose is to should be achieved be nodded by the Xie Jie of search structure. Deepness is preferential search what go up down HTML file to exceed catenary to go cannot again thorough till, return the HTML file of a contact next, in continueing to choose this HTML file again other exceed cable length. Should have no longer other when exceeding catenary to be able to choose, explain the search has ended. Preferential search alls over deepness aptly the site that all previous appoints is deep-seated perhaps the HTML document part of nest, but to extensive search, because Web structure is quite deep, also perhaps went out not to come forever

3, range is preferential collect is politic

Range is preferential collect strategy is the content in searching same layer first, continue to search below one again next. If there are 3 to exceed cable length in file of a HTML, choice among them handle corresponding HTML document along with all the others (attention: What the transaction file here points to is to the file content undertakes retrieval, on the file other exceed a link not to make processing first) , be returned next and the choice a moment ago exceeded cable length the 2nd times of the first webpage, handle corresponding HTML document, return again. Once on same layer all exceeding catenary already had been handled, can begin to be in a moment ago treated the others searchs in HTML file exceed cable length. (the definition that range links)

Assured pair of shallow-layers so handle above all, when encountering an endless deep-seated branch, also won’t again defect goes in. Preferential collect strategy realizes width easily and be used extensively, but the HTML document that needs to cost longer time ability to arrive at depth

4, collect collect strategy

The way that some webpages can refer through the user undertakes collect, for example certain business website is given out to search engine collect application, collect implement in the index database that the webpage information that submits application website with respect to OK and directional collect adds search engine.

