diff --git a/README.md b/README.md index 5ded344a..75208349 100644 --- a/README.md +++ b/README.md @@ -25,23 +25,25 @@ Without active preservation effort, everything on the internet eventually dissap *ArchiveBox is an open source tool that helps organizations and individuals archive web content and retain control over their data: save copies of browser bookmarks, preserve evidence for legal cases, backup photos from FB / Insta / Flickr, download your media from YT / Soundcloud / etc., snapshot research papers & academic citations, and more...* -> ➡️ *Use ArchiveBox on [Linux](#quickstart)/[macOS](#quickstart)/[Windows](#quickstart)/[Docker](#quickstart) as a [CLI tool](#usage), [self-hosted Web App](https://github.com/ArchiveBox/ArchiveBox/wiki/Publishing-Your-Archive), [`pip` library](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#python-shell-usage), or [one-off command](#static-archive-exporting).* +> ➡️ *ArchiveBox is available on [Linux](#quickstart)/[macOS](#quickstart)/[Windows](#quickstart)/[Docker](#quickstart) as a [CLI tool](#usage), [self-hosted Web App](https://github.com/ArchiveBox/ArchiveBox/wiki/Publishing-Your-Archive), [`pip` library](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#python-shell-usage), or [one-off command](#static-archive-exporting).*
./archive/{Snapshot.id}/
./archive/TIMESTAMP/
includes a static +-> *NOTE: These exports are not paginated, exporting many URLs or the entire archive at once may be slow. Use the filtering CLI flags on the `archivebox list` command to export specific Snapshots or ranges.* - -```bash| -# do a one-off single URL archive wihout needing a data dir initialized +NOTE: These exports are not paginated, exporting many URLs or the entire archive at once may be slow. Use the filtering CLI flags on the
+archivebox list
command to export specific Snapshots or ranges.
# do a one-off single URL archive wihout needing a data dir initialized
archivebox oneshot 'https://example.com'
# archivebox list --help
@@ -843,16 +852,17 @@ archivebox list --csv=timestamp,url,title > index.csv # export to csv spreadshe
# (if using Docker Compose, add the -T flag when piping)
# docker compose run -T archivebox list --html 'https://example.com' > index.json
-```
+
The paths in the static exports are relative, make sure to keep them next to your `./archive` folder when backing them up or viewing them.
-#### Learn More
-
-- https://github.com/ArchiveBox/ArchiveBox/wiki/Publishing-Your-Archive#2-export-and-host-it-as-static-html
-- https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#publishing
-- https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#public_index--public_snapshots--public_add_view
+# don't save private content to ArchiveBox, e.g.:
archivebox add 'https://docs.google.com/document/d/12345somePrivateDocument'
archivebox add 'https://vimeo.com/somePrivateVideo'
@@ -893,19 +902,22 @@ archivebox manage createsuperuser
# if extra paranoid or anti-Google:
archivebox config --set SAVE_FAVICON=False # disable favicon fetching (it calls a Google API passing the URL's domain part only)
archivebox config --set CHROME_BINARY=chromium # ensure it's using Chromium instead of Chrome
-```
+
-> *CAUTION: Assume anyone *viewing* your archives will be able to see any cookies, session tokens, or private URLs passed to ArchiveBox during archiving.*
-> *Make sure to secure your ArchiveBox data and don't share snapshots with others without stripping out sensitive headers and content first.*
++-#### Learn More - -- https://github.com/ArchiveBox/ArchiveBox/wiki/Publishing-Your-Archive -- https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview -- https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile -- https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#chrome_user_data_dir -- https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#cookies_file +CAUTION: Assume anyone viewing your archives will be able to see any cookies, session tokens, or private URLs passed to ArchiveBox during archiving. +Make sure to secure your ArchiveBox data and don't share snapshots with others without stripping out sensitive headers and content first.
+
# visiting an archived page with malicious JS:
https://127.0.0.1:8000/archive/1602401954/example.com/index.html
# example.com/index.js can now make a request to read everything from:
https://127.0.0.1:8000/index.html
https://127.0.0.1:8000/archive/*
# then example.com/index.js can send it off to some evil server
-```
+
-The admin UI is also served from the same origin as replayed JS, so malicious pages could also potentially use your ArchiveBox login cookies to perform admin actions (e.g. adding/removing links, running extractors, etc.). We are planning to fix this security shortcoming in a future version by using separate ports/origins to serve the Admin UI and archived content (see [Issue #239](https://github.com/ArchiveBox/ArchiveBox/issues/239)).
-
-> *NOTE: Only the `wget` & `dom` extractor methods execute archived JS when viewing snapshots, all other archive methods produce static output that does not execute JS on viewing.*
-> *If you are worried about these issues ^ you should disable these extractors using `archivebox config --set SAVE_WGET=False SAVE_DOM=False`.*
-
-#### Learn More
-
-- https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview
-- https://github.com/ArchiveBox/ArchiveBox/issues/239
-- https://github.com/ArchiveBox/ArchiveBox/security/advisories/GHSA-cr45-98w9-gwqx (`CVE-2023-45815`)
-- https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#publishing
+++NOTE: Only the
+wget
&dom
extractor methods execute archived JS when viewing snapshots, all other archive methods produce static output that does not execute JS on viewing. +If you are worried about these issues ^ you should disable these extractors usingarchivebox config --set SAVE_WGET=False SAVE_DOM=False
.
CHROME_USER_AGENT
, WGET_USER_AGENT
, CURL_USER_AGENT
to impersonate a real browser (instead of an ArchiveBox bot)CHROME_DATA_DIR
& COOKIES_FILE
reddit.com/some/url
-> teddit.net/some/url
: https://github.com/mendel5/alternative-front-endsarchivebox add 'https://example.com#2020-10-24'
...
archivebox add 'https://example.com#2020-10-25'
-```
+
The