diff --git a/README.md b/README.md index 44dd7096..5ded344a 100644 --- a/README.md +++ b/README.md @@ -125,7 +125,7 @@ curl -sSL 'https://get.archivebox.io' | sh ## 🤝 Professional Integration -ArchiveBox is free for everyone to self-host, but we also provide support, security review, and custom integrations to help NGOs and other organizations [run ArchiveBox professionally](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102): +ArchiveBox is free for everyone to self-host, but we also provide support, security review, and custom integrations to help NGOs, governments, and other organizations [run ArchiveBox professionally](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102): - 🗞️ **Journalists:** `crawling and collecting research`, `preserving quoted material`, `fact-checking and review` @@ -161,7 +161,7 @@ ArchiveBox is free for everyone to self-host, but we also provide support, secur #### ✳️  Easy Setup -
+
Docker docker-compose (macOS/Linux/Windows)   👈  recommended   (click to expand)
👍 Docker Compose is recommended for the easiest install/update UX + best security + all the extras out-of-the-box. @@ -466,7 +466,7 @@ docker compose run archivebox help - `archivebox` `schedule` to pull in fresh URLs regularly from [bookmarks/history/Pocket/Pinboard/RSS/etc.](#input-formats)
-
+
curl sh automatic setup script CLI Usage Examples (non-Docker)

@@ -520,7 +520,7 @@ ls ./archive/*/index.html  # or inspect snapshot data directly on the filesystem
 
 
-
+
🖥  Web UI Usage

 # Start the server on bare metal (pip/apt/brew/etc):
@@ -555,8 +555,8 @@ docker compose run archivebox config --set ...
 
> [!TIP] -> Whether in Docker or not, ArchiveBox commands all work the same way, and can be used in tandem to access the same data directory. -> For example, you can run the Web UI in Docker Compose, and run one-off commands on host with `pip`-installed ArchiveBox or in Docker interchangeably. +> Whether in Docker or not, ArchiveBox commands work the same way, and can be used to access the same data on-disk. +> For example, you could run the Web UI in Docker Compose, and run one-off commands with `pip`-installed ArchiveBox.
Expand to show comparison...
@@ -641,29 +641,36 @@ It also includes a built-in scheduled import feature with `archivebox schedule` ## Output Formats: What ArchiveBox saves for each URL - -Inside each Snapshot folder, ArchiveBox saves these different types of extractor outputs as plain files: - -`./archive/{Snapshot.id}/` -- **Index:** `index.html` & `index.json` HTML and JSON index files containing metadata and details -- **Title**, **Favicon**, **Headers** Response headers, site favicon, and parsed site title -- **SingleFile:** `singlefile.html` HTML snapshot rendered with headless Chrome using SingleFile -- **Wget Clone:** `example.com/page-name.html` wget clone of the site with `warc/TIMESTAMP.gz` -- Chrome Headless - - **PDF:** `output.pdf` Printed PDF of site using headless chrome - - **Screenshot:** `screenshot.png` 1440x900 screenshot of site using headless chrome - - **DOM Dump:** `output.html` DOM Dump of the HTML after rendering using headless chrome -- **Article Text:** `article.html/json` Article text extraction using Readability & Mercury -- **Archive.org Permalink:** `archive.org.txt` A link to the saved site on archive.org -- **Audio & Video:** `media/` all audio/video files + playlists, including subtitles & metadata with youtube-dl (or yt-dlp) -- **Source Code:** `git/` clone of any repository found on GitHub, Bitbucket, or GitLab links -- _More coming soon! See the [Roadmap](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap)..._ +For each web page added, ArchiveBox creates a Snapshot folder and preserves its content as ordinary files inside the folder (e.g. HTML, PDF, PNG, JSON, etc.). -It does everything out-of-the-box by default, but you can disable or tweak [individual archive methods](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration) via environment variables / config. +It uses all available methods out-of-the-box, but you can disable extractors and fine-tune the [configuration](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration) as-needed. +
+
+Expand to see the full list of ways ArchiveBox saves each page... + + +./archive/{Snapshot.id}/
+
    +
  • Index: index.html & index.json HTML and JSON index files containing metadata and details
  • +
  • Title, Favicon, Headers Response headers, site favicon, and parsed site title
  • +
  • SingleFile: singlefile.html HTML snapshot rendered with headless Chrome using SingleFile
  • +
  • Wget Clone: example.com/page-name.html wget clone of the site with warc/TIMESTAMP.gz
  • +
  • Chrome Headless
      +
    • PDF: output.pdf Printed PDF of site using headless chrome
    • +
    • Screenshot: screenshot.png 1440x900 screenshot of site using headless chrome
    • +
    • DOM Dump: output.html DOM Dump of the HTML after rendering using headless chrome
    • +
  • +
  • Article Text: article.html/json Article text extraction using Readability & Mercury
  • +
  • Archive.org Permalink: archive.org.txt A link to the saved site on archive.org
  • +
  • Audio & Video: media/ all audio/video files + playlists, including subtitles & metadata with youtube-dl (or yt-dlp)
  • +
  • Source Code: git/ clone of any repository found on GitHub, Bitbucket, or GitLab links
  • +
  • More coming soon! See the Roadmap...
  • +
+

## Configuration @@ -671,19 +678,20 @@ It does everything out-of-the-box by default, but you can disable or tweak [indi ArchiveBox can be configured via environment variables, by using the `archivebox config` CLI, or by editing `./ArchiveBox.conf` directly. - -```bash -archivebox config # view the entire config +
+
+Expand to see examples... +
archivebox config                               # view the entire config
 archivebox config --get CHROME_BINARY           # view a specific value
-
+
archivebox config --set CHROME_BINARY=chromium # persist a config using CLI # OR echo CHROME_BINARY=chromium >> ArchiveBox.conf # persist a config using file # OR env CHROME_BINARY=chromium archivebox ... # run with a one-off config -``` - -These methods also work the same way when run inside Docker, see the Docker Configuration wiki page for details. +
+These methods also work the same way when run inside Docker, see the Docker Configuration wiki page for details. +

The configuration is documented here: **[Configuration Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration)**, and loaded here: [`archivebox/config.py`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/config.py). @@ -767,13 +775,12 @@ Installing directly on **Windows without Docker or WSL/WSL2/Cygwin is not offici ## Archive Layout All of ArchiveBox's state (SQLite DB, archived assets, config, logs, etc.) is stored in a single folder called the "ArchiveBox Data Folder". -Data folders can be created anywhere (`~/archivebox` or `$PWD/data` as seen in our examples), and you can create as many data folders as you want to hold different collections.
Expand to learn more about the layout of Archivebox's data on-disk...
- +Data folders can be created anywhere (`~/archivebox` or `$PWD/data` as seen in our examples), and you can create as many data folders as you want to hold different collections. All archivebox CLI commands are designed to be run from inside an ArchiveBox data folder, starting with archivebox init to initialize a new collection inside an empty directory.
mkdir ~/archivebox && cd ~/archivebox   # just an example, can be anywhere
@@ -851,8 +858,6 @@ The paths in the static exports are relative, make sure to keep them next to you
 
---- -
security graphic
@@ -1075,10 +1080,6 @@ If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to
- ---- - -
paisley graphic @@ -1127,7 +1128,7 @@ A variety of open and closed-source archiving projects exist, but few provide a
-Click to read more...
+Click to read about how we differ from other centralized archiving services and open source tools...
ArchiveBox tries to be a robust, set-and-forget archiving solution suitable for archiving RSS feeds, bookmarks, or your entire browsing history (beware, it may be too big to store), including private/authenticated content that you wouldn't otherwise share with a centralized service. @@ -1156,33 +1157,21 @@ ArchiveBox is neither the highest fidelity nor the simplest tool available for s
-
-
-dependencies graphic -
+ ## Internet Archiving Ecosystem - -Our Community Wiki page serves as an index of the broader web archiving community. - -
    -
  • See where archivists hang out online
  • -
  • Explore other open-source tools for your web archiving needs
  • -
  • Learn which organizations are the big players in the web archiving space
  • -
-
-Explore our index of web archiving software, blogs, and communities around the world... +Our Community Wiki strives to be a comprehensive index of the broader web archiving community...
- [Community Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community) - [Web Archiving Software](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#web-archiving-projects) _List of ArchiveBox alternatives and open source projects in the internet archiving space._ - - [The Master Lists](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#the-master-lists) - _Community-maintained indexes of archiving tools and institutions._ + - [Awesome-Web-Archiving Lists](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#the-master-lists) + _Community-maintained indexes of archiving tools and institutions like `iipc/awesome-web-archiving`._ - [Reading List](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#reading-list) _Articles, posts, and blogs relevant to ArchiveBox and web archiving in general._ - [Communities](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#communities) @@ -1201,7 +1190,6 @@ Our Community Wiki page serves as an index of the broader web archiving communit
----
documentation graphic @@ -1376,28 +1364,19 @@ archivebox init --setup
-#### Run the linters +#### Run the linters / tests
Click to expand... ```bash ./bin/lint.sh -``` -(uses `flake8` and `mypy`) - -
- -#### Run the integration tests - -
Click to expand... - -```bash ./bin/test.sh ``` -(uses `pytest -s`) +(uses `flake8`, `mypy`, and `pytest -s`)
+ #### Make migrations or enter a django shell
Click to expand... @@ -1492,47 +1471,31 @@ Extractors take the URL of a page to archive, write their output to the filesyst ## Further Reading -- Home: [ArchiveBox.io](https://archivebox.io) -- Demo: [Demo.ArchiveBox.io](https://demo.archivebox.io) -- Docs: [Docs.ArchiveBox.io](https://docs.archivebox.io) -- Releases: [Github.com/ArchiveBox/ArchiveBox/releases](https://github.com/ArchiveBox/ArchiveBox/releases) -- Wiki: [Github.com/ArchiveBox/ArchiveBox/wiki](https://github.com/ArchiveBox/ArchiveBox/wiki) -- Issues: [Github.com/ArchiveBox/ArchiveBox/issues](https://github.com/ArchiveBox/ArchiveBox/issues) -- Discussions: [Github.com/ArchiveBox/ArchiveBox/discussions](https://github.com/ArchiveBox/ArchiveBox/discussions) -- Community Chat: [Zulip Chat (preferred)](https://zulip.archivebox.io) or [Matrix Chat (old)](https://app.element.io/#/room/#archivebox:matrix.org) + + +- [ArchiveBox.io Homepage](https://archivebox.io) / [Source Code (Github)](https://github.com/ArchiveBox/ArchiveBox) / [Demo Server](https://demo.archivebox.io) +- [Documentation Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki) / [API Reference Docs](https://docs.archivebox.io) / [Changelog](https://github.com/ArchiveBox/ArchiveBox/releases) +- [Bug Tracker](https://github.com/ArchiveBox/ArchiveBox/issues) / [Discussions](https://github.com/ArchiveBox/ArchiveBox/discussions) / [Community Chat Forum (Zulip)](https://zulip.archivebox.io) - Social Media: [Twitter](https://twitter.com/ArchiveBoxApp), [LinkedIn](https://www.linkedin.com/company/archivebox/), [YouTube](https://www.youtube.com/@ArchiveBoxApp), [Alternative.to](https://alternativeto.net/software/archivebox/about/), [Reddit](https://www.reddit.com/r/ArchiveBox/) -- Donations: [Github.com/ArchiveBox/ArchiveBox/wiki/Donations](https://github.com/ArchiveBox/ArchiveBox/wiki/Donations) --- +
+🏛️ Contact us for professional support 💬


- -
- -This project is maintained mostly in my spare time with the help from generous contributors. - - -

- -**🏛️ [Contact us for professional support](https://docs.sweeting.me/s/archivebox-consulting-services) 💬** - -
-     - - -
-ArchiveBox operates as a US 501(c)(3) nonprofit, donations are tax-deductible.
(fiscally sponsored by HCB EIN: 81-2908499)

- -(网站存档 / 爬虫) - - - - -
-
-✨ Have spare CPU/disk/bandwidth and want to help the world?
Check out our Good Karma Kit...
+   +   +
+ArchiveBox operates as a US 501(c)(3) nonprofit (sponsored by HCB), donations are tax-deductible. +

+  +  +
+ArchiveBox was started by Nick Sweeting in 2017, and has grown steadily with help from our amazing contributors. +
+✨ Have spare CPU/disk/bandwidth after all your 网站存档爬 and want to help the world?
Check out our Good Karma Kit...