Apache Nutch 2.2 のダウンロード

ダウンロード
提供元サイトからダウンロード

改善内容:
* NUTCH-1576 Need to keep hotStore.flush() exception catching (James Sullivan via lewismc)
* NUTCH-1577 Add target for creating eclipse project (tejasp via lewismc)
* NUTCH-1545 capture batchId and remove references to segments in 2.x crawl script. (Feng)
* NUTCH-1575 support solr authentication in nutch 2.x (Feng)
* NUTCH-1569 Upgrade 2.x to Gora 0.3 (lewismc)
* NUTCH-1243 Junit jar removed from lib (lewismc)
* NUTCH-1249 and NUTCH-1275 : Resolve all issues flagged up by adding javac -Xlint argument (tejasp)
* NUTCH-1513 Support Robots.txt for Ftp urls (tejasp)
* NUTCH-1053 Parsing of RSS feeds fails (tejasp)
* NUTCH-1563 FetchSchedule#getFields is never used by GeneratorJob (Feng)
* NUTCH-1573 Upgrade to most recent JUnit 4.x to improve test flexibility (lewismc)
* Added crawler-commons dependency in pom.xml (tejasp)
* NUTCH-956 solrindex issues: add field tld to Solr schema (Alexis via lewismc, snagel)
* NUTCH-1277 Fix [fallthrough] javac warnings (tejasp)
* NUTCH-1514 Phase out the deprecated configuration properties (if possible) (tejasp)
* NUTCH-1273 Fix [deprecation] javac warnings (lewsimc + tejasp)
* NUTCH-1031 Delegate parsing of robots.txt to crawler-commons (tejasp)
* NUTCH-346 Improve readability of logs/hadoop.log (Renaud Richardet via tejasp)
* NUTCH-1501 Harmonize behavior of parsechecker and indexchecker (snagel + lewismc)
* NUTCH-1551 Improve WebTableReader field order and display batchId (lewismc)
* NUTCH-1552 possibility of a NPE in index-more plugin (kaveh minooie via lewismc)
* NUTCH-1547 BasicIndexingFilter – Problem to index full title (Feng)
* NUTCH-1389 parsechecker and indexchecker to report truncated content (snagel)
* NUTCH-1419 parsechecker and indexchecker to report protocol status (snagel via lewismc)
* NUTCH-1038 Port IndexingFiltersChecker to 2.0 (snagel via lewismc)
* NUTCH-1532 Replace 'segment’ mapping field with batchId (patches v2 + v3) (Feng +via lewismc)
* NUTCH-1533 Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage (Feng via lewismc)
* NUTCH-XX fix Elastic Search Ivy configuration (Binoy d via lewismc)
* NUTCH-1542 “adddays" param for generator not present in 2.x (tejasp)
* NUTCH-1393 Display consistent usage of GeneratorJob with 1.X (Lufeng +via lewismc)
* NUTCH-1540 Add Gora buffered read and write maximum limits to nutch-default.xml configuration. (lewismc)
* NUTCH-842 AutoGenerate WebPage code (jnioche via lewismc)
* NUTCH-1536 Ant build file has hardcoded conf dir location (zm via lewismc)
* NUTCH-XX remove unused db.max.inlinks property in nutch-default.xml (lewismc)
* NUTCH-1284 Add site fetcher.max.crawl.delay as log output by default (tejasp)
* NUTCH-1453 Substantiate tests for IndexingFilters (lufeng via lewismc)
* NUTCH-1274 Fix [cast] javac warnings (tejasp via lewismc)
* NUTCH-1516 Nutch 2.x pom.xml out of sync with ivy.xml (lewismc)
* NUTCH-1510 Upgrade to Hadoop 1.1.1 (markus)
* NUTCH-1503 Configuration properties not in sync between FetcherReducer and nutch-default.xml (snagel + lewismc)
* NUTCH-1394 backport NUTCH-1232 Remove site field from index-basic (lewismc)
* NUTCH-1370 Expose exact number of urls injected @runtime (ferdy, snagel and lewismc)
(includes commit for NUTCH-1471 make explicit which datastore urls are injected to)
* NUTCH-1484 TableUtil unreverseURL fails on file:// URLs (Rogério Pereira Araújo via snagel)
* NUTCH-1451 Upgrade automaton jar to 1.11-8 (lewismc)
* NUTCH-1496 ParserJob logs skipped urls with level info (Nathan Gass via lewismc)
* NUTCH-1488 bin/nutch to run junit from any directory (snagel via lewismc)
* NUTCH-1493 Error adding field 'contentLength’=" during solrindex using index-more (Nathan Gass via lewismc)
* NUTCH-1491 Strip UTF-8 non-character codepoints in title (Nathan Gass via markus)
* NUTCH-1421 RegexURLNormalizer to only skip rules with invalid patterns (snagel)
* NUTCH-1433 Upgrade to Tika 1.2 (jnioche)
* NUTCH-1087 Deprecate crawl command and replace with example script (jnioche)
* NUTCH-874 Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora (part 1) (Kiran Chitturi via lewismc)
* NUTCH-1344 BasicURLNormalizer to normalize https same as http (snagel)
* NUTCH-706 Url regex normalizer: pattern for session id removal not to match “newsId" (Meghna Kukreja via snagel)

Apache Nutch はオープンソースの Web 検索エンジン。「検索エンジン Lucene +全文検索 Solr + Web クローラー+スコアリング(Page Rank)+分散実行の仕組み」が Nutch。ちなみに検索インデックスを作成&格納するための mapper / reducer および分散ファイルシステムが Nutch から spin out したのが話題の Hadoop。

IT

Posted by arkgame