李石君
开通时间:..
最后更新时间:..
点击次数:
所属单位:(1) School of Computer, Wuhan University, Wuhan 430079, China
发表刊物:Journal of Computational Information Systems
摘要:The crucial issue of Deep Web Integration is that How to efficiently locate query interfaces of the Deep Web resources. The existing crawlers need to retrieve many off-topic pages in order to get the links' delayed benefit. However, the consideration of the delayed benefit reduces the crawling speed and may make the crawler deviate from the topic. Thus we propose a Deep Web crawling Strategy based on feedback model. In the strategy, we use the ordinal regression model to construct a page classifier to classify the retrieved pages into three levels. And we also need link extractor to extract the three levels' links. During the crawling, we consider the result of the classifier as the feedback which revels whether the links extracted by link extractor satisfy the page classifier. According to the feedback, we extract the features of the links that meet the page classifier. The features can guide the crawler to quickly extract links which satisfy the page classifier. Thus we avoid many off-topic links while remain the links which have delayed benefit. The experimental results indicate that our crawler can automatically extract the promising links' features and avoid many off-topic links, getting an increment of the crawler's speed and accuracy.
合写作者: Jianwei(1),Tian, Guowen(1), Li, Shijun(1)
是否译文:否
发表时间:2008-01-01