-
Notifications
You must be signed in to change notification settings - Fork 774
Open
Description
Hi!
I'm crawling with a little bit less politeness configuration than the default and I'm frequently getting (1971 times in the 12 hours I've been crawling):
Mar 27, 2023 10:43:53 AM org.archive.modules.CrawlURI getPolitenessDelay
WARNING: politessDelay unset, returning default 5000 for https://www.unidavi.edu.br/fiqueAtento/2023/3/pedidos-vagas-1-2023-fora-do-prazo-07 (in thread 'ToeThread #163: https://www.unidavi.edu.br/fiqueAtento/2023/3/pedidos-vagas-1-2023-fora-do-prazo-07')
Is this expected? The configuration rules I've modified and that are related to politeness are:
<bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">
<!-- <property name="timeoutSeconds" value="1200" /> -->
<property name="timeoutSeconds" value="300" /> <!-- 5 min -->
</bean>
<bean id="disposition" class="org.archive.crawler.postprocessor.DispositionProcessor">
<!-- <property name="delayFactor" value="5.0" /> -->
<property name="delayFactor" value="2.0" />
<!-- <property name="minDelayMs" value="3000" /> -->
<property name="minDelayMs" value="1000" /> <!-- 1 sec -->
<!-- <property name="respectCrawlDelayUpToSeconds" value="300" /> -->
<property name="respectCrawlDelayUpToSeconds" value="100" />
<!-- <property name="maxDelayMs" value="30000" /> -->
<property name="maxDelayMs" value="10000" /> <!-- 10 sec -->
</bean>
<bean id="frontier"
class="org.archive.crawler.frontier.BdbFrontier">
<!-- <property name="snoozeLongMs" value="300000" /> -->
<property name="snoozeLongMs" value="250000" /> <!-- 2.5 min -->
<!-- <property name="retryDelaySeconds" value="900" /> -->
<property name="retryDelaySeconds" value="300" /> <!-- 5 min -->
<!-- <property name="maxRetries" value="30" /> -->
<property name="maxRetries" value="3" /> <!-- It should be incresed in case of large crawls (e.g. months) -->
</bean>Thank you!
Metadata
Metadata
Assignees
Labels
No labels