1
About
2
Getting started
2.1
Where we are now
2.2
What we can do about it
3
Tools for downloading data
3.1
Command-line utilities
3.1.1
Installing and using Unix shells
3.1.2
wget / wget2
3.1.3
curl
3.1.4
aria2
3.2
R
3.2.1
httr / [httr2]
3.3
Python
4
Locating data on live websites
4.1
File path reverse engineering
4.1.1
Collecting URLs
4.1.2
Examples
4.2
Using APIs
4.3
Inspecting network requests
4.4
Files meant for web crawlers
5
Downloading from the Internet Archive
5.1
Using the Wayback Machine web interface
5.1.1
The search bar
5.1.2
The Calendar tab
5.1.3
Capture pages
5.1.4
The Site Map tab
5.1.5
The URLs tab
5.2
Using download scripts
5.2.1
Querying pages
5.2.2
Downloading pages
5.3
Using the CDX API
5.3.1
Querying the CDX API
5.3.2
Reconstructing the Wayback Machine URL from CDX API output
5.3.3
Examples
5.4
Common pitfalls
5.4.1
Rate limits
5.4.2
Non-nested websites and externally-linked resources
5.5
Worked examples
6
Triage
6.1
Low concern: statically-linked files
6.2
Medium concern: links from dynamic webpages
6.3
High concern: data behind portals
6.4
Highest concern: data served only by APIs
6.5
Summary
7
Storage and dissemination
Preserving Public Health Data
2
Getting started
Note: this page is under construction (there are no complete sections).
2.1
Where we are now
2.2
What we can do about it