`
koreyoshi
  • 浏览: 236780 次
  • 性别: Icon_minigender_1
  • 来自: 深圳
社区版块
存档分类
最新评论

Heritrix学习及遇到问题汇总(四)

阅读更多
1.
message:Value of illegal type: 'org.archive.crawler.settings.ModuleType', 'org.archive.crawler.framework.Frontier' was expected.: Value of illegal type: 'org.archive.crawler.settings.ModuleType', 'org.archive.crawler.framework.Frontier' was expected.
Exception:No associated exception.

2.
message:On crawl: question Unable to setup crawl modules
exception:java.lang.ClassCastException: org.archive.crawler.settings.ModuleType cannot be cast to org.archive.crawler.framework.Frontier
Stacktrace: java.lang.ClassCastException: org.archive.crawler.settings.ModuleType cannot be cast to org.archive.crawler.framework.Frontier
at org.archive.crawler.framework.CrawlController.setupCrawlModules(CrawlController.java:675)
at org.archive.crawler.framework.CrawlController.initialize(CrawlController.java:381)
at org.archive.crawler.admin.CrawlJob.setupForCrawlStart(CrawlJob.java:853)
at org.archive.crawler.admin.CrawlJobHandler.startNextJobInternal(CrawlJobHandler.java:1144)
at org.archive.crawler.admin.CrawlJobHandler$3.run(CrawlJobHandler.java:1127)
at java.lang.Thread.run(Thread.java:619)

3.
message:Wrong document type 'crawl-order' in 'file:/c:/heritrix/jobs/question-20141005032127804/order.xml', line: 1, column: 160
exception:No associated exception.

解决方案:一般都是由于处理器链没有正确设置而导致
比如,在应该是Prefetcher的地方,设置成了Writer。这样就会导致错误
请严格按照以下方式来设置:
1. frontier
org.archive.crawler.frontier.BdbFrontier
2. scope
org.archive.crawler.scope.BroadScope
3. Prefetcher
org.archive.crawler.prefetch.Preselector
org.archive.crawler.prefetch.PreconditionEnforcer
4. Fetcher
org.archive.crawler.fetcher.FetchDNS
org.archive.crawler.fetcher.FetchHTTP
5. Extractor
org.archive.crawler.extractor.ExtractorHTTP
org.archive.crawler.extractor.ExtractorHTML
(这里可以按自己的需要多添几个,比如ExtractorSWF、ExtractorJS什么的,但是前两个是必不可少的)
6. Writer
可以是MirrorWriter或ARCWriter,一般建议使用MirrorWriter
7. PostProcessor
org.archive.crawler.postprocessor.CrawlStateUpdater
org.archive.crawler.postprocessor.LinksScoper
org.archive.crawler.postprocessor.FrontierScheduler
(FrontierScheduler可以自行扩展,按书上的方法)
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics