Elastic Crawler Developer's Guide

Running from Source
Configuration
Installation
Architecture
Testing the elastic-crawler
- Unit tests

Running from Source

ℹ️ Open Crawler uses both JRuby and Java. We recommend using version managers for both. When developing Open Crawler we use rbenv and jenv. There are instructions for setting up these env managers here:

Clone the repository: git clone https://github.com/elastic/crawler.git

Go to the root of the Open Crawler directory and check the expected Java and Ruby versions are being used:

# should output the same version as `.ruby-version`
ruby --version

# should output the same version as `.java-version`
java -version

If the versions seem correct, you can install dependencies:
```
make install
```
You can also use the env variable CRAWLER_MANAGE_ENV to have the install script automatically check whether rbenv and jenv are installed, and that the correct versions are running on both: Doing this requires that you use both rbenv and jenv in your local setup.
```
CRAWLER_MANAGE_ENV=true make install
```
Now you should be able to run Crawler locally
```
bin/crawler crawl path/to/config.yml
```

Configuration

The crawler details need to be provided in a crawler configuration file. You can specify Elasticsearch instance configuration within that file, or optionally in a separate configuration file. This allows multiple crawlers to share a single Elasticsearch configuration.

For more details check out the following documentation.

Architecture

Starting with the endpoints specified in seed_urls in the config, the coordinator creates a crawl task and adds it to a queue. These are then executed by the HTTP executor, which will produce a crawl result, which contains further links to follow. The coordinator will then send the crawl result to the output sink, and create more crawl tasks for the links it found. The output sink will then format the doc using the document mapper before outputting the result.

If the output sink is console or file, it simply outputs the crawl result as soon as it is crawled.

If the output sink is elasticsearch, it adds crawl results to a bulk queue for processing. The bulk queue is added to until a threshold is met (either queue number or queue size in bytes). It will then flush the queue, which prompts a _bulk API request to the configured Elasticsearch instance. The _bulk API settings can be configured in the config file.

Testing the elastic-crawler

Unit tests are found under the spec directory. We require unit tests to be added or updated for every contribution.

Unit Tests

We have makefile commands to run tests. These act as a wrapper around a typical bundle exec rspec command. You can use the makefile command to run all tests in the repo, all tests in a single file, or a single spec in a file. Target files are specified with the file=/path/to/spec argument.

# run all tests in elastic-crawler
make test

# runs all unit tests in `crawl_spec.rb`
make test file=spec/lib/crawler/api/crawl_spec.rb

# runs only the unit test on line 35 in `crawl_spec.rb`
make test file=spec/lib/crawler/api/crawl_spec.rb:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elastic Crawler Developer's Guide

Running from Source

Configuration

Architecture

Testing the elastic-crawler

Unit Tests

FilesExpand file tree

DEVELOPER_GUIDE.md

Latest commit

History

DEVELOPER_GUIDE.md

File metadata and controls

Elastic Crawler Developer's Guide

Running from Source

Configuration

Architecture

Testing the elastic-crawler

Unit Tests