Introduction
polymath is a highly modular open source web crawler. It has been designed to be easily modular during its processing phases. And it has other qualities too:
- Support PDF by default
- Can be used via a CLI or Kafka*
- Low CPU and RAM usage, thanks to Rust
* HTTP API is planned, but wouldn’t be enabled by default.
License
The polymath source code and documentation are released under the Mozilla Public License v2.0.
Installation
Pre-compiled binaries
We provide precompiled binaries on the GitHub Releases page. Download the binary for your platform (Windows, macOS or Linux) and extract the archive.
Download from source
You must install Rust and Cargo.
Once you have installed Rust, execute the following command to build and install polymath:
cargo install --profile cli --git https://github.com/Lubmminy/polymath polymath-cli
If you want to uninstall polymath-cli, execute cargo uninstall polymath-cli
.
Uninstallation
Pre-compiled binaries
Simply delete the binary located in the folder where you installed it.
Don’t forget also to remove the link in the PATH
, if you have done so.
Download from source
If you want to uninstall polymath-cli, execute cargo uninstall polymath-cli
.
You can also go to ~/.cargo/bin/
, and delete it manually.
Deployment
While the CLI is useful to test the crawler, in production, you should prefer to deploy a server. Given that Polymath is a collection of libraries, you can create your own server quite easily. However, you can use our own server. The documentation below shows you how.
⚠️ To have an API over HTTP or RPC, you need to use a customised server.
Docker
We recommend that you deploy the polymath crawler using Docker (or Podman). In this example, we’re going to deploy polymath with its extension to save pages on Apache Solr.
Create a docker-compose.yaml
and write:
services:
solr:
image: solr:9-slim
ports:
- 8983:8983
volumes:
- data:/var/solr
command:
- solr-precreate
- gettingstarted
zookeeper:
image: wurstmeister/zookeeper
ports:
- 2181:2181
kafka:
image: wurstmeister/kafka
depends_on:
- zookeeper
ports:
- 9092:9092
environment:
KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:9092,OUTSIDE://localhost:9093
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
KAFKA_LISTENERS: INSIDE://0.0.0.0:9092,OUTSIDE://0.0.0.0:9093
KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_CREATE_TOPICS: "baeldung:1:1"
polymath:
image: ghcr.io/lubmminy/polymath
depends_on:
- solr
- kafka
volumes:
data:
Command Line Tool
The polymath
command-line tool is used to crawl web pages.
After you have installed polymath-cli
, you can run the polymath-cli help
command in your terminal to view the available commands.
This following sections provide in-depth information on the different commands available.
polymath-cli crawl <URL>
— Crawling one web page.
crawl
The crawl command
Crawl command is used to fetch web pages.
polymath-cli crawl <URL>
You MUST specify a URL. This URL will be fetched and all other URLs within it will then be crawled.
--depth
When you use the --depth
(-d
) flag, you specify the number of pages to be fetched.
- Default: 1
- Maximum value: 100 (no maximum value in production)
--robots-txt
The --robots-txt
flag allows you to bypass /robots.txt
.
This can lead to failures and rate-limitations.
You must specify a boolean (true or false). For more details, see the robots.txt extension.
--path
The --path
(-p
) flag lets you specify a directory path where all fetched pages will be saved as text content. Alternatively, use --solr-address
to save pages on Apache Solr instead of hard text.
--solr-address
The --solr-address
option allows to conect crawler on Solr collection.
Example: --solr-address http://localhost:8983/api/collections/websites
Robots.txt
This extension allows web crawlers to access and respect instructions from a website’s robots.txt
file before starting the crawl.
A robots.txt
file is a text file on a website that tells search engine crawlers (like bots) which parts of the site they can access.
The extension supports two main directives from robots.txt
:
Crawl-delay
: specify and set the number of seconds the robot must wait between each successive request.Disallow
: prevents the crawler from accessing specific URLs or directories on the website.
The extension also checks for meta robots tags in web pages, which provide additional instructions for crawling alongside robots.txt. For more information on meta robots, see More information.
The extension can also be used to get sitemaps from robots.txt.
Example
use polymath_crawler::Crawler; use robots::Extension; #[tokio::main] async fn main() { // Create custom crawler. let crawler = Crawler::new() .with_robots_txt(true) .with_name("Gravitaliabot"); // Start crawling websites. // It will firsly check https://example.com/robots.txt before // crawling site. crawler.fetch(vec!["https://example.com/"]).await; }
Sitemap
This extension allows you to access and read the sitemaps provided by the site. Either by looking directly at /sitemap.xml
, or via /robots.txt
.
Example
use polymath_crawler::Crawler; use sitemap::Extension; #[tokio::main] async fn main() { // Create custom crawler. let crawler = Crawler::new() .with_sitemap(true); // Start crawling websites. // It will firsly check https://example.com/robots.txt before // crawling site. crawler.fetch(vec!["https://example.com/"]).await; }
Solr
The Solr extension lets you save crawled pages in an Apache Solr collection.