Optimize train of thought: Search the duplicate webpage of engine algorithm

Search engine judges duplicate webpage to be based on commonly so a thought:

For every webpage computation gives dactylogram of a group of information (Fingerprint) , if two webpages have a certain quantity of identical information dactylogram, think the content jackknife sex of these two webpages is very tall, that is to say two webpages are content is duplicate.

The method with duplicate content of judgement of engine of a lot of searches is not quite same, basically be the following 2 o’clock different:

1, computational information dactylogram (Fingerprint) algorithmic;

2, the parameter of the similar rate that judges information dactylogram.

Before describing specific algorithm, say to be clear about first at 2 o’clock:

1, what is information dactylogram?

Information dactylogram is a webpage inside text information, collect certain information, can be key word, word, sentence or paragraph and the weight that its are inside the webpage, undertake adding to it close, add like MD5 close, a string that forms thereby. Information dactylogram is like the dactylogram of colleague, as long as content is not identical, information dactylogram is different.

2, the information of algorithmic extraction is not to be aimed at pieces of whole webpage

However common share is like navigation inside the website, the information such as Logo, copyright (the is a webpage noise that these weighing) the text version that after filtering, remains.

Section autograph is algorithmic

This kind of algorithm is cut the webpage into N according to certain regulation paragraph, have sign one’s name to each paragraphs, form each paragraphs information dactylogram. If this N there is M inside information dactylogram identical when (the Que Zhi that M is systematic definition) , think both it is duplicate webpage.

This kind of algorithm is duplicate to miniature judgement the webpage is a kind of very good algorithm, but for so huge to resembling Google search engine, algorithmic complex degree quite tall.

The duplicate webpage that is based on a keyword is algorithmic

Like Google this kind searchs engine, he is in capture webpage when can remember news of next the following webpage:

1, the keyword that appears in the webpage (Chinese segments a technology) and the weight of every keyword (keyword density) .

2, Ption of ī of extraction Meta Descr or the active character of 512 byte of every webpage.

About at 2 o’clock, baidu and Google differ somewhat, google is the Ption of Meta Descr ī that extracts you, if do not have 512 byte related inquiry key word, and Baidu is direct extraction latter. This everybody has used experience somewhat.

In the following and algorithmic description, we agree dactylogram of a few information is variable:

Pi expresses I webpage;

The N with this webpage highest weight the keyword forms gather Ti={t1, t2, . . .tn} , its corresponding weight is Wi={w1, w2, . . .wi}

Summary information expresses with Des(Pi) , before N the string with mosaic keyword expresses with Con(Ti) , to this N the string that forms after keyword sort expresses with Sort(Ti) .

Above information dactylogram undertakes adding with MD5 function close.

The duplicate webpage algorithm that is based on a keyword has the following 5 kinds:

1, MD5(Des(Pi))=MD5(Des(Pj)) , just the same of information of summary of that is to say, two webpages consider as I and J duplicate webpage.

2, MD5(Con(Ti))=MD5(Con(Tj)) , the N before two webpages the sort of keyword and its weight is same, consider as duplicate webpage.

3, MD5(Sort(Ti))=MD5(Sort(Tj)) , the N before two webpages the keyword is same, weight is OK different, also consider as duplicate webpage.

4, MD5(Con(Ti))=MD5(Con(Tj)) and the square of Wi-Wj is less than a certain Que Zhi A except the square the sum with Wi and Wj, think both it is duplicate webpage.
5, MD5(Sort(Ti))=MD5(Sort(Tj)) and the square of Wi-Wj is less than a certain Que Zhi A except the square the sum with Wi and Wj, think both it is duplicate webpage.

About the 4th with the A of that Que Zhi of the 5th, because condition of a judgement falls,basically be, still can have a lot of webpages by accidentally injure, search engine development undertakes modulatory according to the distributinging scale of weight, prevent accidental injury.

This is engine of search of net of Beijing University day go weighing algorithm (OK and referenced: " search engine- – principle, technology and system " one book) , when 5 kinds of algorithm run above, algorithmic effect depends on N, it is keyword numerary choose. Of course, chosen amount is more, judgement can be jumped over accurate, but the computational speed that who knows and comes also can be decelerated come down. Must consider speed of a computation and the balance that go weighing accuracy rate so. According to result of day network test, 10 or so keywords are the most appropriate.

Adscript

Above cannot enclothe a large search to leave all fields that prop up duplicate webpage for certain, they are sure still have a few additional information dactylogram judgement, the article serves as a train of thought, prop up an optimized train of thought to the index that do search.

Ask an author to contact this station, seasonable annotations your full name. Contact mailbox: EDu#chinaz.com (# instead @ ) .