Releases: internetarchive/heritrix3
3.12.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
New features
- ConfigurableExtractorJS: Regex rules to skip extracting
<script>tags when their attributes match. #672
Bug fixes
- Docs: Switch bean docs generation to an annotation processor, fixing the bean reference broken by Java language changes. #683
- StatisticsTracker: Don’t restore
crawlEndTimewhen resuming from a checkpoint. #669 - ExtractorJS: Fix overriding the
strictsetting in sheets. #670 - Berkeley DB: Handle more shutdown interrupts gracefully. #671
Dependency upgrades
- amqp-client: 5.26.0 → 5.27.0
- groovy: 4.0.28 → 5.0.2
- jaxb-runtime: 4.0.5 → 4.0.6
- jetty: 12.0.27 → 12.0.29
- jsch: 2.27.3 → 2.27.4
- junit-jupiter: 5.13.4 → 6.0.0
- kafka-clients: 3.9.1 → 4.1.0
- pdfbox: 3.0.5 → 3.0.6
- rethinkdb-driver: 2.3.3 → 2.4.4
- spring: 6.2.11 → 6.2.12
- webarchive-commons: 3.0.0 → 3.0.1
- webjars-locator-lite: 1.1.0 → 1.1.2
3.11.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
New features
- KnowledgableExtractorJS now extends ConfigurableExtractorJS for its additional options. #668
Bug fixes
- Invalid characters are now stripped from the XML REST API output. Log file truncation after an unclean shutdown can sometimes introduce such characters. #667
Dependency upgrades
- codemirror@language: 6.11.2 → 6.11.3
- jakarta.xml.bind-api: 4.0.2 → 4.0.4
- jetty: 12.0.25 → 12.0.27
- jsch: 2.27.2 → 2.27.3
- gson: 2.13.1 → 2.13.2
- spring: 6.2.10 → 6.2.11
3.10.2
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
Bug fixes
- AMQPPublishProcessor: The User-Agent string is now included in the metadata so Umbra can use it in its own requests. #663
- FetchDNS: DNS lookups returning
0.0.0.0are now treated as resolution failure. #665
Dependency upgrades
- amqp-client: 5.25.0 → 5.26.0
- codemirror@language: 6.11.1 → 6.11.2
- codemirror@legacy-modes: 6.5.0 → 6.5.1
- codemirror@view: 6.37.2 → 6.38.1
- commons-cli: 1.9.0 → 1.10.0
- commons-codec: 1.18.0 → 1.19.0
- commons-net: 3.11.1 → 3.12.0
- jetty: 12.0.22 → 12.0.25
- junit-jupiter: 5.13.3 → 5.13.4
- groovy: 4.0.27 → 4.0.28
- spring-framework: 6.2.9 → 6.2.10
3.10.1
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
Bug fixes
-
FetchHTTP2
- HTTP/1.1 is now used on servers that don't support ALPN. Fixes
IOException: frame_size_error/invalid_frame_length - Fixed NullPointerException when the server's IP address isn't available.
- HTTP/1.1 is now used on servers that don't support ALPN. Fixes
-
Seeds report: Redirect URIs are now recorded from the
Locationheader for HTTP status codes303 See other,
307 Temporary Redirectand308 Permanent Redirect.
Previously this was only done for301 Moved Permanentlyand302 Found. -
Public suffixes list: A resource naming conflict between webarchive-commons and crawler-commons for
effective_tld_names.datwas resolved and the list was updated to the latest version.
Dependency upgrades
- codemirror@state: 6.4.0 → 6.5.11
- codemirror@view: 6.37.1 → 6.37.2
- commons-lang: 2.6 → 3.18.0
- commons-io: 2.19.0 → 2.20.0
- crawler-commons: 1.4 → 1.5
- jetty: 12.0.17 → 12.0.22
- jsch: 2.27.0 → 2.27.2
- junit-jupiter: 5.13.2 → 5.13.3
- restlet: 2.6.0-rc1 → 2.6.0
- spring: 6.2.7 → 6.2.9
- webarchive-commons: 2.0.1 → 3.0.0
3.10.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
New features
-
BrowserProcessor: Loads fetched pages in a local browser (Firefox/ChromeDriver), records all browser requests,
and runs pluggable behaviors (e.g. scrolling, link extraction). #653- Uses the WebDriver BiDi protocol for browser automation.
- The recording proxy is built on Jetty's ProxyHandler and the FetchHTTP2 module.
- Status: Working for small crawls but needs more robust error handling (browser crashes, resource limits).
-
Basic web auth: You can now switch the web interface from Digest authentication to Basic authentication with the
--web-auth basiccommand-line option. This is useful when running Heritrix behind a reverse proxy that adds external authentication. #654 -
Robots.txt wildcards: The
*and$wildcard rules from RFC 9309 are now supported. #656 -
FetchHTTP2: Added HTTP proxy support. #657
Fixes
-
Code editor: The configuration editor and script console were upgraded to CodeMirror 6. This resolves some browser incompatibilities, allowing CodeMirror’s own find function to be re-enabled for reliable text search of content far outside the viewport. #651
-
BDB shutdown interrupt handling: The thread’s interrupted flag is now cleared before some BDB interactions to reduce the likelihood of environment invalidation when requestCrawlStop() is called repeatedly. #659
-
FetchHTTP2: Fixed gzip alert log messages by configuring HttpClient to not decode gzip encoding from response.
Removals
-
Removed Apache HttpClient 3: If you have custom Heritrix modules you may need to update the following
class references in your code:Removed Replacement org.apache.commons.httpclient.URIExceptionorg.archive.url.URIExceptionorg.apache.commons.httpclient.Headerorg.archive.format.http.HttpHeaderNote that Apache HttpClient 4 (
org.apache.http) was not removed. #652
Dependency Upgrades
- codemirror: 2.23 → 6
- easymock: 5.5.0 → removed
- groovy: 4.0.26 → 4.0.27
- junit: 5.12.2 → 5.13.1
- kafka-clients: 3.9.0 → 3.9.1
- spring: 6.2.6 → 6.2.7
- webarchive-commons: 1.3.0 → 2.0.1
3.9.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
New features
- FetchHTTP2: Added a new fetch module supporting HTTP/2 and HTTP/3. #649
Fixes
- Fixed HighestUriPrecedenceProvider: Added Histotable serializer and Kryo autoregistration. #647
Changes
- JUnit 5: Upgraded all JUnit 3 and 4 style tests to JUnit 5. #650
Dependency Upgrades
- commons-io: 2.18.0 → 2.19.0
- gson: 2.12.1 → 2.13.1
- jetty: 9.4.19.v20190610 → 12.0.17
- jsch: 0.2.24 → 2.27.0
- junit: 4.13.2 → 5.12.2
- pdfbox: 3.0.4 → 3.0.5
- restlet: 2.5.0 → 2.6.0-RC1
- spring: 6.2.5 → 6.2.6
3.8.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
New Features
- ExtractorYoutubeDL processArguments: New option for overriding the default
yt-dlpprocess arguments. #644
Fixes
- Slow tests: Fixed
ObjectIdentityBdbManualCacheTestso it no longer fails when running tests with-DrunSlowTests=true. #643 - Test stability: Disabled
FetchHTTPTest.testHostHeaderDefaultPortdue to sporadic test failures. - Code cleanup: Fixed some compiler and IDE warnings. Removed unused utility classes (JavaLiterals, LogUtils). #645
Dependency Upgrades
- amqp-client: 5.24.0 → 5.25.0
- beanshell: 2.0b5 → 2.0b6
- commons-codec: 1.17.2 → 1.18.0
- dnsjava: 3.6.2 → 3.6.3
- groovy: 4.0.24 → 4.0.26
- gson: 2.11.0 → 2.12.1
- jsch: 0.2.22 → 0.2.24
- pdfbox: 3.0.3 → 3.0.4
- slf4j: 2.0.16 → 2.0.17
- spring: 6.1.16 → 6.2.5
3.7.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
New Features
-
Groovy crawl configs (experimental): Groovy Bean Definition DSL can now be used as an experimental alternative to Spring XML. This enables more terse and human-readable job configuration with inline scripting capabilities. There is no user interface for it in this release. For now, you must manually create a crawler-beans.groovy file in your job directory. #632
-
ExtractorHTML obeyRelNofollow: This option skips extraction of links marked
rel=nofollow. This is useful for avoiding crawler traps on some sites. #638
Fixes
- Cookie rejected warning: The slf4j change in 3.6.0 inadvertently caused a previously hidden warning to be logged to
job.logwhen a server sends aSet-Cookieheader with a disallowed domain value. This warning is now suppressed since it occurs frequently and does not require any action from the crawl operator. #640
Changes
- Removed fastutil: A small number of usages of fastutil were replaced with standard library equivalents in webarchive-commons and Heritrix. This reduced the Heritrix distribution size from 51 MB to 34 MB. iipc/webarchive-commons#101
Dependency Upgrades
- amqp-client 5.24.0
- commons-codec 1.17.2
- ftpserver-core 1.2.1
- freemarker 2.3.34
- jetty 9.4.57.v20241219
- jsch 0.2.22
- restlet 2.5.0
- spring 6.1.16
- webarchive-commons 1.3.0
3.6.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
Java Compatibility Notice
This release of Heritrix requires Java 17 or later.
New Features
- Automatic Checkpoints on Shutdown: Added
checkpointOnShutdownoption toCheckpointServiceto enable automatic checkpoints if Heritrix is gracefully terminated. #626 - Command-Line Checkpoint Selection: The
--checkpointcommand-line option restarts from a named checkpoint when using the--run-joboption. #626 - ConfigurableExtractorJS forceStrictIfUrlMatchingRegexList: URLs matching the regular expressions on this list will be processed in strict mode, with only absolute URLs extracted, not relative ones. #624
Changes
- Upgraded to Spring Framework 6.1: The Spring
@Requiredannotation has been removed, so it was replaced with a custom implementation to maintain backward compatibility with existing crawl configurations. Spring 6 requires Java 17 so Heritrix does now too. #625
Fixes
- Manifest Hop Priority: Links from sitemaps are now given the same priority as normal navigation links. They were incorrectly being prioritized as transitive hops (embeds). #623
- SLF4J Logging: Heritrix now includes
slf4j-jdk14to eliminate a startup warning message and fix logging for dependencies (such as crawler-commons) that use SLF4J. Heritrix doesn't use SLF4J itself. #628
Dependency Upgrades
- amqp-client 5.23.0
- commons-cli 1.9.0
- commons-codec 1.17.1
- commons-io 2.18.0
- commons-net 3.11.1
- crawler-commons 1.4
- dnsjava 3.6.2
- easymock 5.5.0
- freemarker 2.3.33
- groovy 4.0.24
- gson 2.11.0
- httpcomponents 4.5.14
- java-socks-proxy-server 4.1.2
- java-websocket removed
- jaxb-runtime 4.0.5
- jsch switched to mwiede fork 0.2.21
- junit 4.13.2
- kafka-clients 3.9.0
- kryo 5.6.2
- pdfbox 3.0.3
- slf4j 2.0.16
- spring-framework 6.1.15
- webarchive-commons 1.2.0
3.5.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
End of interim releases
This release drops the term "interim release" which distinguished releases made temporarily by the community in the absence of releases made by Internet Archive. The community releases have effectively become the official releases.
In conjunction with this, the version numbers which were paused at 3.4.0 for the interim releases, have now resumed incrementing following the scheme major.minor.patch with the minor release number incremented when features are added or removed.
Java compatibility notice
This will likely be the last release of Heritrix compatible with Java 8. The next release is expected to require Java 17 or later.
Changes in this release
Removals
- Removed HBase modules from contrib. #621
Fixes
- ConfigurableExtractorJS: Set default value (false) for strict property. #612
- ExtractorHTML: Treat
citeattribute as a navlink instead of embed. #608 - Building no longer requires the builds.archive.org or Cloudera repositories. #614
- Updated to new URL of the restlet repository.
Dependency Upgrades
- Removed hbase, joda-time, log4j
- commons-io 2.14.0
- kafka-clients 3.8.0
- ftpserver-core 1.2.0
- jetty 9.4.56.v20240826
- webarchive-commons 1.1.10