Skip to content

Commit 47a29da

Browse files
authored
Fix #1763 - Documentation fixes from #1714 (#1765)
1 parent 2a6930a commit 47a29da

File tree

4 files changed

+12
-12
lines changed

4 files changed

+12
-12
lines changed

docs/src/main/asciidoc/architecture.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ You may not use this file except in compliance with the License.
44
You may obtain a copy of the License at:
55
https://www.apache.org/licenses/LICENSE-2.0
66
////
7-
7+
[[architecture]]
88
== Understanding StormCrawler's Architecture
99

1010
=== Architecture Overview

docs/src/main/asciidoc/configuration.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ This is what the configuration `http.robots.agents` allows you to do. It is a co
5353

5454
=== Proxy
5555

56-
StormCrawler's proxy system is built on top of the link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/SCProxy.java[SCProxy] class and the link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/ProxyManager.java[ProxyManager] interface. Every proxy used in the system is formatted as a **SCProxy**. The **ProxyManager** implementations handle the management and delegation of their internal proxies. At the call of link:https://stormcrawler.net/docs/api/com/digitalpebble/stormcrawler/protocol/Protocol.html#getProtocolOutput-java.lang.String-org.apache.stormcrawler.Metadata-[HTTPProtocol.getProtocolOutput()], the link:https://stormcrawler.net/docs/api/com/digitalpebble/stormcrawler/proxy/ProxyManager.html#getProxy[ProxyManager.getProxy()] is called to retrieve a proxy for the individual request.
56+
StormCrawler's proxy system is built on top of the link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/SCProxy.java[SCProxy] class and the link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/ProxyManager.java[ProxyManager] interface. Every proxy used in the system is formatted as a **SCProxy**. The **ProxyManager** implementations handle the management and delegation of their internal proxies. At the call of `Protocol#getProtocolOutput()`, the `ProxyManager.getProxy()` is called to retrieve a proxy for the individual request.
5757

5858
The **ProxyManager** interface can be implemented in a custom class to create custom logic for proxy management and load balancing. The default **ProxyManager** implementation is **SingleProxyManager**. This ensures backwards compatibility for prior StormCrawler releases. To use **MultiProxyManager** or custom implementations, pass the class path and name via the config parameter `http.proxy.manager`:
5959

docs/src/main/asciidoc/internals.adoc

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,9 @@ https://www.apache.org/licenses/LICENSE-2.0
1010

1111
The Apache StormCrawler components rely on two Apache Storm streams: the _default_ one and another one called _status_.
1212

13-
The aim of the _status_ stream is to pass information about URLs to a persistence layer. Typically, a bespoke bolt will take the tuples coming from the _status_ stream and update the information about URLs in some sort of storage (e.g., ElasticSearch, HBase, etc...), which is then used by a Spout to send new URLs down the topology.
13+
The aim of the _status_ stream is to pass information about URLs to a persistence layer. Typically, a bespoke bolt will take the tuples coming from the _status_ stream and update the information about URLs in some sort of storage (e.g., OpenSearch, HBase, etc...), which is then used by a Spout to send new URLs down the topology.
1414

15-
This is critical for building recursive crawls (i.e., you discover new URLs and not just process known ones). The _default_ stream is used for the URL being processed and is generally used at the end of the pipeline by an indexing bolt (which could also be ElasticSearch, HBase, etc...), regardless of whether the crawler is recursive or not.
15+
This is critical for building recursive crawls (i.e., you discover new URLs and not just process known ones). The _default_ stream is used for the URL being processed and is generally used at the end of the pipeline by an indexing bolt (which could also be OpenSearch, HBase, etc...), regardless of whether the crawler is recursive or not.
1616

1717
Tuples are emitted on the _status_ stream by the parsing bolts for handling outlinks but also to notify that there has been a problem with a URL (e.g., unparsable content). It is also used by the fetching bolts to handle redirections, exceptions, and unsuccessful fetch status (e.g., HTTP code 400).
1818

@@ -29,7 +29,7 @@ As you can see for instance in link:https://github.com/apache/stormcrawler/blob/
2929

3030
The link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/Status.java[Status] enum has the following values:
3131

32-
* DISCOVERED:: outlinks found by the parsers or "seed" URLs emitted into the topology by one of the link:https://stormcrawler.net/docs/api/com/digitalpebble/stormcrawler/spout/package-summary.html[spouts] or "injected" into the storage. The URLs can be already known in the storage.
32+
* DISCOVERED:: outlinks found by the parsers or "seed" URLs emitted into the topology by one of the spouts or "injected" into the storage. The URLs can be already known in the storage.
3333
* REDIRECTION:: set by the fetcher bolts.
3434
* FETCH_ERROR:: set by the fetcher bolts.
3535
* ERROR:: used by either the fetcher, parser, or indexer bolts.
@@ -41,7 +41,7 @@ The link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org
4141

4242
The class link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java[AbstractStatusUpdaterBolt] can be extended to handle status updates for a specific backend. It has an internal cache of URLs with a `discovered` status so that they don't get added to the backend if they already exist, which is a simple but efficient optimisation. It also uses link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/DefaultScheduler.java[DefaultScheduler] to compute a next fetch date and calls MetadataTransfer to filter the metadata that will be stored in the backend.
4343

44-
In most cases, the extending classes will just need to implement the method `store(String URL, Status status, Metadata metadata, Date nextFetch)` and handle their own initialisation in `prepare()`. You can find an example of a class which extends it in the link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java[StatusUpdaterBolt] for Elasticsearch.
44+
In most cases, the extending classes will just need to implement the method `store(String URL, Status status, Metadata metadata, Date nextFetch)` and handle their own initialisation in `prepare()`. You can find an example of a class which extends it in the link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java[StatusUpdaterBolt] for OpenSearch.
4545

4646

4747
=== Bolts
@@ -71,14 +71,14 @@ The **FetcherBolt** has an internal set of queues where the incoming URLs are pl
7171

7272
Incoming tuples spend very little time in the link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java#L768[execute] method of the **FetcherBolt** as they are put in the FetchQueues, which is why you'll find that the value of **Execute latency** in the Storm UI is pretty low. They get acked later on, after they've been fetched. The metric to watch for in the Storm UI is **Process latency**.
7373

74-
The **SimpleFetcherBolt** does not do any of this, hence its name. It just fetches incoming tuples in its `execute` method and does not do multi-threading. It does enforce politeness by checking when a URL can be fetched and will wait until it is the case. It is up to the user to declare multiple instances of the bolt in the Topology class and to manage how the URLs get distributed across the instances of **SimpleFetcherBolt**, often with the help of the link:https:/
74+
The **SimpleFetcherBolt** does not do any of this, hence its name. It just fetches incoming tuples in its `execute` method and does not do multi-threading. It does enforce politeness by checking when a URL can be fetched and will wait until it is the case. It is up to the user to declare multiple instances of the bolt in the Topology class and to manage how the URLs get distributed across the instances of **SimpleFetcherBolt**, often with the help of the link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/util/URLPartitioner.java[URLPartitioner].
7575

7676
=== Indexer Bolts
7777
The purpose of crawlers is often to index web pages to make them searchable. The project contains resources for indexing with popular search solutions such as:
7878

79-
* link:https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/com/digitalpebble/stormcrawler/solr/bolt/IndexerBolt.java[Apache SOLR]
80-
* link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/bolt/IndexerBolt.java[Elasticsearch]
81-
* link:https://github.com/apache/stormcrawler/blob/main/external/aws/src/main/java/com/digitalpebble/stormcrawler/aws/bolt/CloudSearchIndexerBolt.java[AWS CloudSearch]
79+
* link:https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/bolt/IndexerBolt.java[Apache SOLR]
80+
* link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/bolt/IndexerBolt.java[OpenSearch]
81+
* link:https://github.com/apache/stormcrawler/blob/main/external/aws/src/main/java/org/apache/stormcrawler/aws/bolt/CloudSearchIndexerBolt.java[AWS CloudSearch]
8282

8383
All of these extend the class link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/AbstractIndexerBolt.java[AbstractIndexerBolt].
8484

@@ -104,7 +104,7 @@ You can easily build your own custom indexer to integrate with other storage sys
104104
=== Parser Bolts
105105
==== JSoupParserBolt
106106

107-
The link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java[JSoupParserBolt] can be used to parse HTML documents and extract the outlinks, text, and metadata it contains. If you want to parse non-HTML documents, use the link:https://github.com/apache/stormcrawler/tree/main/external/src/main/java/com/digitalpebble/storm/crawler/tika[Tika-based ParserBolt] from the external modules.
107+
The link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java[JSoupParserBolt] can be used to parse HTML documents and extract the outlinks, text, and metadata it contains. If you want to parse non-HTML documents, use the link:https://github.com/apache/stormcrawler/tree/main/external/tika/src/main/java/org/apache/stormcrawler/tika[Tika-based ParserBolt] from the external modules.
108108

109109
This parser calls the xref:urlfilters[URLFilters] and xref:parsefilters[ParseFilters] defined in the configuration. Please note that it calls xref:metadatatransfer[MetadataTransfer] prior to calling the xref:parsefilters[ParseFilters]. If you create new Outlinks in your [[ParseFilters]], you'll need to make sure that you use MetadataTransfer there to inherit the Metadata from the parent document.
110110

docs/src/main/asciidoc/quick-start.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ The archetype will generate a fully-structured project including:
6363

6464
After generation, navigate into the newly created directory (named after the `artifactId` you specified).
6565

66-
TIP: You can learn more about the architecture and how each component works together if you look into link:architecture.adoc[the architecture documentation].
66+
TIP: You can learn more about the architecture and how each component works together if you look into xref:architecture[the architecture documentation].
6767
By exploring that part of the documentation, you can gain a better understanding of how StormCrawler performs crawling and how bolts, spouts, as well as parse and URL filters, collaborate in the process.
6868

6969
==== Docker Compose Setup

0 commit comments

Comments
 (0)