crawl

The crawl command

Crawl command is used to fetch web pages.

polymath-cli crawl <URL>

You MUST specify a URL. This URL will be fetched and all other URLs within it will then be crawled.

--depth

When you use the --depth (-d) flag, you specify the number of pages to be fetched.

  • Default: 1
  • Maximum value: 100 (no maximum value in production)

--robots-txt

The --robots-txt flag allows you to bypass /robots.txt. This can lead to failures and rate-limitations.

You must specify a boolean (true or false). For more details, see the robots.txt extension.

--path

The --path (-p) flag lets you specify a directory path where all fetched pages will be saved as text content. Alternatively, use --solr-address to save pages on Apache Solr instead of hard text.

--solr-address

The --solr-address option allows to conect crawler on Solr collection.

Example: --solr-address http://localhost:8983/api/collections/websites