3 Tools for downloading data

Note: this page is under construction (there are no complete sections).

3.1 Command-line utilities

3.1.1 Installing and using Unix shells

3.1.2 wget / wget2

GNU wget and its successor wget2 are command-line download utilities supporting a wide array of protocols and with many helpful features including recursive retrieval, WARC output, URL parameters and POST data, rate limiting, and automatic retries.

wget is designed to run non-interactively; after specifying your arguments (retrieval options) and invoking the wget commmand, it will run unsupervised until all specified URLs have been retrieved (and child URLs, if retrieving recursively).

Installation instructions:

  • Windows: Through the Windows Subsystem for Linux.
  • macOS: Through Homebrew (brew install wget) or by compiling.
  • Linux: Generally included by default; if not, consult your distribution’s package repository.

Usage

Useful command-line arguments

For an exhaustive list, see the wget documentation or run man wget or wget --help.

Examples

3.1.3 curl

curl is another common data transfer tool that is much better suited for interactive use (for example to explore APIs, or in scripts that parse server responses) and supports a much wider array of protocols. Although it can also be used as a standalone download utility, it does not support recursive retrieval.

Installation instructions:

  • Windows: Through the Windows Subsystem for Linux.
  • macOS: Included by default.
  • Linux: Generally included by default; if not, consult your distribution’s package repository.

Usage

Useful command-line arguments

For an exhaustive list, see the curl man page or run man curl or curl --help all.

Examples

3.1.4 aria2

aria2 is another command-line download tool similar to wget that supports fewer protocols but may be preferable for downloading a large number of files for which the URLs are known. Unlike wget, aria2 marks incomplete or errored downloads as such (allowing for automatic retries on next run and for you to quickly check file integrity at a glance) and has better support for concurrency with URL-based, domain-based, and chunk-based parallelism.

Installation instructions:

  • Windows: Binaries available from the aria2 website.
  • macOS: Through Homebrew (brew install aria2), or from the aria2 website.
  • Linux: Consult your distribution’s package repository.

Usage

Examples

3.2 R

3.2.1 httr / [httr2]

httr and httr2 are R wrappers around the libcurl library used by curl.

Usage

Examples

3.3 Python

requests is a generic HTTP library for Python. Unlike httr and httr2 for R, requests does not build upon libcurl and instead uses urllib3.

Usage

Examples