Introduction

polymath is a highly modular open source web crawler. It has been designed to be easily modular during its processing phases. And it has other qualities too:

  • Support PDF by default
  • Can be used via a CLI or Kafka*
  • Low CPU and RAM usage, thanks to Rust

* HTTP API is planned, but wouldn’t be enabled by default.

License

The polymath source code and documentation are released under the Mozilla Public License v2.0.

Installation

Pre-compiled binaries

We provide precompiled binaries on the GitHub Releases page. Download the binary for your platform (Windows, macOS or Linux) and extract the archive.

Download from source

You must install Rust and Cargo.

Once you have installed Rust, execute the following command to build and install polymath:

cargo install --profile cli --git https://github.com/Lubmminy/polymath polymath-cli

If you want to uninstall polymath-cli, execute cargo uninstall polymath-cli.

Uninstallation

Pre-compiled binaries

Simply delete the binary located in the folder where you installed it.

Don’t forget also to remove the link in the PATH, if you have done so.

Download from source

If you want to uninstall polymath-cli, execute cargo uninstall polymath-cli.

You can also go to ~/.cargo/bin/, and delete it manually.

Deployment

While the CLI is useful to test the crawler, in production, you should prefer to deploy a server. Given that Polymath is a collection of libraries, you can create your own server quite easily. However, you can use our own server. The documentation below shows you how.

⚠️ To have an API over HTTP or RPC, you need to use a customised server.

Docker

We recommend that you deploy the polymath crawler using Docker (or Podman). In this example, we’re going to deploy polymath with its extension to save pages on Apache Solr.

Create a docker-compose.yaml and write:

services:
    solr:
        image: solr:9-slim
        ports:
            - 8983:8983
        volumes:
            - data:/var/solr
        command:
            - solr-precreate
            - gettingstarted

    zookeeper:
        image: wurstmeister/zookeeper
        ports:
            - 2181:2181

    kafka:
        image: wurstmeister/kafka
        depends_on:
            - zookeeper
        ports:
            - 9092:9092
        environment:
            KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:9092,OUTSIDE://localhost:9093
            KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
            KAFKA_LISTENERS: INSIDE://0.0.0.0:9092,OUTSIDE://0.0.0.0:9093
            KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
            KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
            KAFKA_CREATE_TOPICS: "baeldung:1:1"


    polymath:
        image: ghcr.io/lubmminy/polymath
        depends_on:
            - solr
            - kafka

volumes:
    data:

Command Line Tool

The polymath command-line tool is used to crawl web pages. After you have installed polymath-cli, you can run the polymath-cli help command in your terminal to view the available commands.

This following sections provide in-depth information on the different commands available.

crawl

The crawl command

Crawl command is used to fetch web pages.

polymath-cli crawl <URL>

You MUST specify a URL. This URL will be fetched and all other URLs within it will then be crawled.

--depth

When you use the --depth (-d) flag, you specify the number of pages to be fetched.

  • Default: 1
  • Maximum value: 100 (no maximum value in production)

--robots-txt

The --robots-txt flag allows you to bypass /robots.txt. This can lead to failures and rate-limitations.

You must specify a boolean (true or false). For more details, see the robots.txt extension.

--path

The --path (-p) flag lets you specify a directory path where all fetched pages will be saved as text content. Alternatively, use --solr-address to save pages on Apache Solr instead of hard text.

--solr-address

The --solr-address option allows to conect crawler on Solr collection.

Example: --solr-address http://localhost:8983/api/collections/websites

Robots.txt

This extension allows web crawlers to access and respect instructions from a website’s robots.txt file before starting the crawl.

A robots.txt file is a text file on a website that tells search engine crawlers (like bots) which parts of the site they can access. The extension supports two main directives from robots.txt:

  • Crawl-delay: specify and set the number of seconds the robot must wait between each successive request.
  • Disallow: prevents the crawler from accessing specific URLs or directories on the website.

The extension also checks for meta robots tags in web pages, which provide additional instructions for crawling alongside robots.txt. For more information on meta robots, see More information.

The extension can also be used to get sitemaps from robots.txt.

Example

use polymath_crawler::Crawler;
use robots::Extension;

#[tokio::main]
async fn main() {
    // Create custom crawler.
    let crawler = Crawler::new()
        .with_robots_txt(true)
        .with_name("Gravitaliabot");

    // Start crawling websites.
    // It will firsly check https://example.com/robots.txt before
    // crawling site.
    crawler.fetch(vec!["https://example.com/"]).await;
}

Sitemap

This extension allows you to access and read the sitemaps provided by the site. Either by looking directly at /sitemap.xml, or via /robots.txt.

Example

use polymath_crawler::Crawler;
use sitemap::Extension;

#[tokio::main]
async fn main() {
    // Create custom crawler.
    let crawler = Crawler::new()
        .with_sitemap(true);

    // Start crawling websites.
    // It will firsly check https://example.com/robots.txt before
    // crawling site.
    crawler.fetch(vec!["https://example.com/"]).await;
}

Solr

The Solr extension lets you save crawled pages in an Apache Solr collection.