Skip to content

3.10.0

Choose a tag to compare

@ato ato released this 12 Jun 13:22
· 95 commits to master since this release

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

New features

  • BrowserProcessor: Loads fetched pages in a local browser (Firefox/ChromeDriver), records all browser requests,
    and runs pluggable behaviors (e.g. scrolling, link extraction). #653

    • Uses the WebDriver BiDi protocol for browser automation.
    • The recording proxy is built on Jetty's ProxyHandler and the FetchHTTP2 module.
    • Status: Working for small crawls but needs more robust error handling (browser crashes, resource limits).
  • Basic web auth: You can now switch the web interface from Digest authentication to Basic authentication with the --web-auth basic command-line option. This is useful when running Heritrix behind a reverse proxy that adds external authentication. #654

  • Robots.txt wildcards: The * and $ wildcard rules from RFC 9309 are now supported. #656

  • FetchHTTP2: Added HTTP proxy support. #657

Fixes

  • Code editor: The configuration editor and script console were upgraded to CodeMirror 6. This resolves some browser incompatibilities, allowing CodeMirror’s own find function to be re-enabled for reliable text search of content far outside the viewport. #651

  • BDB shutdown interrupt handling: The thread’s interrupted flag is now cleared before some BDB interactions to reduce the likelihood of environment invalidation when requestCrawlStop() is called repeatedly. #659

  • FetchHTTP2: Fixed gzip alert log messages by configuring HttpClient to not decode gzip encoding from response.

Removals

  • Removed Apache HttpClient 3: If you have custom Heritrix modules you may need to update the following
    class references in your code:

    Removed Replacement
    org.apache.commons.httpclient.URIException org.archive.url.URIException
    org.apache.commons.httpclient.Header org.archive.format.http.HttpHeader

    Note that Apache HttpClient 4 (org.apache.http) was not removed. #652

Dependency Upgrades

  • codemirror: 2.23 → 6
  • easymock: 5.5.0 → removed
  • groovy: 4.0.26 → 4.0.27
  • junit: 5.12.2 → 5.13.1
  • kafka-clients: 3.9.0 → 3.9.1
  • spring: 6.2.6 → 6.2.7
  • webarchive-commons: 1.3.0 → 2.0.1