axonomyofcrawlersQ:Howdoesasearchengineknowthatallthesepagescontainthequeryterms?A:BecauseallofthosepageshavebeencrawledCrawler:basicideastartingpages(seeds)ManynamesCrawlerSpiderRobot(orbot)WebagentWanderer,worm,…Andfamousinstances:googlebot,scooter,slurp,msnbot,…MotivationforcrawlersSupportuniversalsearchengines(Google,Yahoo,MSN/WindowsLive,Ask,etc.)Vertical(specialized)searchengines,,shopping,papers,recipes,reviews,:petitors,partnersMonitorWebsitesofinterestEvil:harvestemailsforspamming,phishing……Canyouthinkofsomeothers?…AcrawlerwithinasearchengineWebTextindexPageRankPagerepositorygooglebotText&linkanalysisQueryhitsRankerCrawlingprocessSpiders(Robots/Bots/Crawlers)<href…><href…><href…><href…><href…><href…><href…>网页为节点HTML链接引用为有向边系统框图
web数据挖掘Web爬取 来自淘豆网www.taodocs.com转载请标明出处.