ℹ️ Open Crawler uses both JRuby and Java. We recommend using version managers for both. When developing Open Crawler we use rbenv and jenv. There are instructions for setting up these env managers here:
-
Clone the repository:
git clone https://github.com/elastic/crawler.git -
Go to the root of the Open Crawler directory and check the expected Java and Ruby versions are being used:
# should output the same version as `.ruby-version` ruby --version # should output the same version as `.java-version` java -version
-
If the versions seem correct, you can install dependencies:
make install
You can also use the env variable
CRAWLER_MANAGE_ENVto have the install script automatically check whetherrbenvandjenvare installed, and that the correct versions are running on both: Doing this requires that you use bothrbenvandjenvin your local setup.CRAWLER_MANAGE_ENV=true make install
-
Now you should be able to run Crawler locally
bin/crawler crawl path/to/config.yml
The crawler details need to be provided in a crawler configuration file. You can specify Elasticsearch instance configuration within that file, or optionally in a separate configuration file. This allows multiple crawlers to share a single Elasticsearch configuration.
For more details check out the following documentation.
Starting with the endpoints specified in seed_urls in the config, the coordinator creates a crawl task and adds it to a queue.
These are then executed by the HTTP executor, which will produce a crawl result, which contains further links to follow.
The coordinator will then send the crawl result to the output sink, and create more crawl tasks for the links it found.
The output sink will then format the doc using the document mapper before outputting the result.
If the output sink is console or file, it simply outputs the crawl result as soon as it is crawled.
If the output sink is elasticsearch, it adds crawl results to a bulk queue for processing.
The bulk queue is added to until a threshold is met (either queue number or queue size in bytes).
It will then flush the queue, which prompts a _bulk API request to the configured Elasticsearch instance.
The _bulk API settings can be configured in the config file.
Unit tests are found under the spec directory. We require unit tests to be added or updated for every contribution.
We have makefile commands to run tests.
These act as a wrapper around a typical bundle exec rspec command.
You can use the makefile command to run all tests in the repo, all tests in a single file, or a single spec in a file.
Target files are specified with the file=/path/to/spec argument.
# run all tests in elastic-crawler
make test
# runs all unit tests in `crawl_spec.rb`
make test file=spec/lib/crawler/api/crawl_spec.rb
# runs only the unit test on line 35 in `crawl_spec.rb`
make test file=spec/lib/crawler/api/crawl_spec.rb:35