ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2025-05-16 16:14:28 -04:00

History

Ross Williams 310b4d1242 Add htmltotext extractor Saves HTML text nodes and selected element attributes in `htmltotext.txt` for each Snapshot. Primarily intended to be used for search indexing.		2023-10-23 21:42:32 -04:00
..
__init__.py	add proper support for URL_WHITELIST instead of using negation regexes	2021-07-06 23:42:00 -04:00
csv.py	split up utils into separate files	2019-04-30 23:13:04 -04:00
html.py	Add htmltotext extractor	2023-10-23 21:42:32 -04:00
json.py	fix extra arg	2021-04-13 02:21:51 -04:00
schema.py	Add htmltotext extractor	2023-10-23 21:42:32 -04:00
sql.py	rename TAG_SEPARATORS to TAG_SEPARATOR_PATTERN	2022-01-06 14:14:41 +00:00