Feedback model based deep web crawling strategy
Date of Publication:2008-01-01
Hits:
- Affiliation of Author(s):
- (1) School of Computer, Wuhan University, Wuhan 430079, China
- Journal:
- Journal of Computational Information Systems
- Abstract:
- The crucial issue of Deep Web Integration is that How to efficiently locate query interfaces of the Deep Web resources. The existing crawlers need to retrieve many off-topic pages in order to get the links' delayed benefit. However, the consideration of the delayed benefit reduces the crawling speed and may make the crawler deviate from the topic. Thus we propose a Deep Web crawling Strategy based on feedback model. In the strategy, we use the ordinal regression model to construct a page classifier to classify the retrieved pages into three levels. And we also need link extractor to extract the three levels' links. During the crawling, we consider the result of the classifier as the feedback which revels whether the links extracted by link extractor satisfy the page classifier. According to the feedback, we extract the features of the links that meet the page classifier. The features can guide the crawler to quickly extract links which satisfy the page classifier. Thus we avoid many off-topic links while remain the links which have delayed benefit. The experimental results indicate that our crawler can automatically extract the promising links' features and avoid many off-topic links, getting an increment of the crawler's speed and accuracy.
- Co-author:
- Jianwei(1),Tian, Guowen(1), Li, Shijun(1)
- Translation or Not:
- no
- Date of Publication:
- 2008-01-01