Skip to content

Releases: internetarchive/heritrix3

3.12.0

30 Oct 01:52
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

New features

  • ConfigurableExtractorJS: Regex rules to skip extracting <script> tags when their attributes match. #672

Bug fixes

  • Docs: Switch bean docs generation to an annotation processor, fixing the bean reference broken by Java language changes. #683
  • StatisticsTracker: Don’t restore crawlEndTime when resuming from a checkpoint. #669
  • ExtractorJS: Fix overriding the strict setting in sheets. #670
  • Berkeley DB: Handle more shutdown interrupts gracefully. #671

Dependency upgrades

  • amqp-client: 5.26.0 → 5.27.0
  • groovy: 4.0.28 → 5.0.2
  • jaxb-runtime: 4.0.5 → 4.0.6
  • jetty: 12.0.27 → 12.0.29
  • jsch: 2.27.3 → 2.27.4
  • junit-jupiter: 5.13.4 → 6.0.0
  • kafka-clients: 3.9.1 → 4.1.0
  • pdfbox: 3.0.5 → 3.0.6
  • rethinkdb-driver: 2.3.3 → 2.4.4
  • spring: 6.2.11 → 6.2.12
  • webarchive-commons: 3.0.0 → 3.0.1
  • webjars-locator-lite: 1.1.0 → 1.1.2

3.11.0

22 Sep 05:04
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

New features

  • KnowledgableExtractorJS now extends ConfigurableExtractorJS for its additional options. #668

Bug fixes

  • Invalid characters are now stripped from the XML REST API output. Log file truncation after an unclean shutdown can sometimes introduce such characters. #667

Dependency upgrades

  • codemirror@language: 6.11.2 → 6.11.3
  • jakarta.xml.bind-api: 4.0.2 → 4.0.4
  • jetty: 12.0.25 → 12.0.27
  • jsch: 2.27.2 → 2.27.3
  • gson: 2.13.1 → 2.13.2
  • spring: 6.2.10 → 6.2.11

3.10.2

29 Aug 08:32
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

Bug fixes

  • AMQPPublishProcessor: The User-Agent string is now included in the metadata so Umbra can use it in its own requests. #663
  • FetchDNS: DNS lookups returning 0.0.0.0 are now treated as resolution failure. #665

Dependency upgrades

  • amqp-client: 5.25.0 → 5.26.0
  • codemirror@language: 6.11.1 → 6.11.2
  • codemirror@legacy-modes: 6.5.0 → 6.5.1
  • codemirror@view: 6.37.2 → 6.38.1
  • commons-cli: 1.9.0 → 1.10.0
  • commons-codec: 1.18.0 → 1.19.0
  • commons-net: 3.11.1 → 3.12.0
  • jetty: 12.0.22 → 12.0.25
  • junit-jupiter: 5.13.3 → 5.13.4
  • groovy: 4.0.27 → 4.0.28
  • spring-framework: 6.2.9 → 6.2.10

3.10.1

21 Jul 08:20
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

Bug fixes

  • FetchHTTP2

    • HTTP/1.1 is now used on servers that don't support ALPN. Fixes IOException: frame_size_error/invalid_frame_length
    • Fixed NullPointerException when the server's IP address isn't available.
  • Seeds report: Redirect URIs are now recorded from the Location header for HTTP status codes 303 See other,
    307 Temporary Redirect and 308 Permanent Redirect.
    Previously this was only done for 301 Moved Permanently and 302 Found.

  • Public suffixes list: A resource naming conflict between webarchive-commons and crawler-commons for
    effective_tld_names.dat was resolved and the list was updated to the latest version.

Dependency upgrades

  • codemirror@state: 6.4.0 → 6.5.11
  • codemirror@view: 6.37.1 → 6.37.2
  • commons-lang: 2.6 → 3.18.0
  • commons-io: 2.19.0 → 2.20.0
  • crawler-commons: 1.4 → 1.5
  • jetty: 12.0.17 → 12.0.22
  • jsch: 2.27.0 → 2.27.2
  • junit-jupiter: 5.13.2 → 5.13.3
  • restlet: 2.6.0-rc1 → 2.6.0
  • spring: 6.2.7 → 6.2.9
  • webarchive-commons: 2.0.1 → 3.0.0

3.10.0

12 Jun 13:22
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

New features

  • BrowserProcessor: Loads fetched pages in a local browser (Firefox/ChromeDriver), records all browser requests,
    and runs pluggable behaviors (e.g. scrolling, link extraction). #653

    • Uses the WebDriver BiDi protocol for browser automation.
    • The recording proxy is built on Jetty's ProxyHandler and the FetchHTTP2 module.
    • Status: Working for small crawls but needs more robust error handling (browser crashes, resource limits).
  • Basic web auth: You can now switch the web interface from Digest authentication to Basic authentication with the --web-auth basic command-line option. This is useful when running Heritrix behind a reverse proxy that adds external authentication. #654

  • Robots.txt wildcards: The * and $ wildcard rules from RFC 9309 are now supported. #656

  • FetchHTTP2: Added HTTP proxy support. #657

Fixes

  • Code editor: The configuration editor and script console were upgraded to CodeMirror 6. This resolves some browser incompatibilities, allowing CodeMirror’s own find function to be re-enabled for reliable text search of content far outside the viewport. #651

  • BDB shutdown interrupt handling: The thread’s interrupted flag is now cleared before some BDB interactions to reduce the likelihood of environment invalidation when requestCrawlStop() is called repeatedly. #659

  • FetchHTTP2: Fixed gzip alert log messages by configuring HttpClient to not decode gzip encoding from response.

Removals

  • Removed Apache HttpClient 3: If you have custom Heritrix modules you may need to update the following
    class references in your code:

    Removed Replacement
    org.apache.commons.httpclient.URIException org.archive.url.URIException
    org.apache.commons.httpclient.Header org.archive.format.http.HttpHeader

    Note that Apache HttpClient 4 (org.apache.http) was not removed. #652

Dependency Upgrades

  • codemirror: 2.23 → 6
  • easymock: 5.5.0 → removed
  • groovy: 4.0.26 → 4.0.27
  • junit: 5.12.2 → 5.13.1
  • kafka-clients: 3.9.0 → 3.9.1
  • spring: 6.2.6 → 6.2.7
  • webarchive-commons: 1.3.0 → 2.0.1

3.9.0

13 May 04:49
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

New features

  • FetchHTTP2: Added a new fetch module supporting HTTP/2 and HTTP/3. #649

Fixes

  • Fixed HighestUriPrecedenceProvider: Added Histotable serializer and Kryo autoregistration. #647

Changes

  • JUnit 5: Upgraded all JUnit 3 and 4 style tests to JUnit 5. #650

Dependency Upgrades

  • commons-io: 2.18.0 → 2.19.0
  • gson: 2.12.1 → 2.13.1
  • jetty: 9.4.19.v20190610 → 12.0.17
  • jsch: 0.2.24 → 2.27.0
  • junit: 4.13.2 → 5.12.2
  • pdfbox: 3.0.4 → 3.0.5
  • restlet: 2.5.0 → 2.6.0-RC1
  • spring: 6.2.5 → 6.2.6

3.8.0

01 Apr 12:18
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

New Features

  • ExtractorYoutubeDL processArguments: New option for overriding the default yt-dlp process arguments. #644

Fixes

  • Slow tests: Fixed ObjectIdentityBdbManualCacheTest so it no longer fails when running tests with -DrunSlowTests=true. #643
  • Test stability: Disabled FetchHTTPTest.testHostHeaderDefaultPort due to sporadic test failures.
  • Code cleanup: Fixed some compiler and IDE warnings. Removed unused utility classes (JavaLiterals, LogUtils). #645

Dependency Upgrades

  • amqp-client: 5.24.0 → 5.25.0
  • beanshell: 2.0b5 → 2.0b6
  • commons-codec: 1.17.2 → 1.18.0
  • dnsjava: 3.6.2 → 3.6.3
  • groovy: 4.0.24 → 4.0.26
  • gson: 2.11.0 → 2.12.1
  • jsch: 0.2.22 → 0.2.24
  • pdfbox: 3.0.3 → 3.0.4
  • slf4j: 2.0.16 → 2.0.17
  • spring: 6.1.16 → 6.2.5

3.7.0

03 Feb 05:26
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

New Features

  • Groovy crawl configs (experimental): Groovy Bean Definition DSL can now be used as an experimental alternative to Spring XML. This enables more terse and human-readable job configuration with inline scripting capabilities. There is no user interface for it in this release. For now, you must manually create a crawler-beans.groovy file in your job directory. #632

  • ExtractorHTML obeyRelNofollow: This option skips extraction of links marked rel=nofollow. This is useful for avoiding crawler traps on some sites. #638

Fixes

  • Cookie rejected warning: The slf4j change in 3.6.0 inadvertently caused a previously hidden warning to be logged to job.log when a server sends a Set-Cookie header with a disallowed domain value. This warning is now suppressed since it occurs frequently and does not require any action from the crawl operator. #640

Changes

  • Removed fastutil: A small number of usages of fastutil were replaced with standard library equivalents in webarchive-commons and Heritrix. This reduced the Heritrix distribution size from 51 MB to 34 MB. iipc/webarchive-commons#101

Dependency Upgrades

  • amqp-client 5.24.0
  • commons-codec 1.17.2
  • ftpserver-core 1.2.1
  • freemarker 2.3.34
  • jetty 9.4.57.v20241219
  • jsch 0.2.22
  • restlet 2.5.0
  • spring 6.1.16
  • webarchive-commons 1.3.0

3.6.0

29 Nov 12:08
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

Java Compatibility Notice

This release of Heritrix requires Java 17 or later.

New Features

  • Automatic Checkpoints on Shutdown: Added checkpointOnShutdown option to CheckpointService to enable automatic checkpoints if Heritrix is gracefully terminated. #626
  • Command-Line Checkpoint Selection: The --checkpoint command-line option restarts from a named checkpoint when using the --run-job option. #626
  • ConfigurableExtractorJS forceStrictIfUrlMatchingRegexList: URLs matching the regular expressions on this list will be processed in strict mode, with only absolute URLs extracted, not relative ones. #624

Changes

  • Upgraded to Spring Framework 6.1: The Spring @Required annotation has been removed, so it was replaced with a custom implementation to maintain backward compatibility with existing crawl configurations. Spring 6 requires Java 17 so Heritrix does now too. #625

Fixes

  • Manifest Hop Priority: Links from sitemaps are now given the same priority as normal navigation links. They were incorrectly being prioritized as transitive hops (embeds). #623
  • SLF4J Logging: Heritrix now includes slf4j-jdk14 to eliminate a startup warning message and fix logging for dependencies (such as crawler-commons) that use SLF4J. Heritrix doesn't use SLF4J itself. #628

Dependency Upgrades

  • amqp-client 5.23.0
  • commons-cli 1.9.0
  • commons-codec 1.17.1
  • commons-io 2.18.0
  • commons-net 3.11.1
  • crawler-commons 1.4
  • dnsjava 3.6.2
  • easymock 5.5.0
  • freemarker 2.3.33
  • groovy 4.0.24
  • gson 2.11.0
  • httpcomponents 4.5.14
  • java-socks-proxy-server 4.1.2
  • java-websocket removed
  • jaxb-runtime 4.0.5
  • jsch switched to mwiede fork 0.2.21
  • junit 4.13.2
  • kafka-clients 3.9.0
  • kryo 5.6.2
  • pdfbox 3.0.3
  • slf4j 2.0.16
  • spring-framework 6.1.15
  • webarchive-commons 1.2.0

3.5.0

29 Oct 06:58
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

End of interim releases

This release drops the term "interim release" which distinguished releases made temporarily by the community in the absence of releases made by Internet Archive. The community releases have effectively become the official releases.

In conjunction with this, the version numbers which were paused at 3.4.0 for the interim releases, have now resumed incrementing following the scheme major.minor.patch with the minor release number incremented when features are added or removed.

Java compatibility notice

This will likely be the last release of Heritrix compatible with Java 8. The next release is expected to require Java 17 or later.

Changes in this release

Removals

  • Removed HBase modules from contrib. #621

Fixes

  • ConfigurableExtractorJS: Set default value (false) for strict property. #612
  • ExtractorHTML: Treat cite attribute as a navlink instead of embed. #608
  • Building no longer requires the builds.archive.org or Cloudera repositories. #614
  • Updated to new URL of the restlet repository.

Dependency Upgrades

  • Removed hbase, joda-time, log4j
  • commons-io 2.14.0
  • kafka-clients 3.8.0
  • ftpserver-core 1.2.0
  • jetty 9.4.56.v20240826
  • webarchive-commons 1.1.10