ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2025-05-24 19:54:25 -04:00

Author	SHA1	Message	Date
Ross Williams	310b4d1242	Add htmltotext extractor Saves HTML text nodes and selected element attributes in `htmltotext.txt` for each Snapshot. Primarily intended to be used for search indexing.	2023-10-23 21:42:32 -04:00
Ross Williams	6555719489	Add space after tags when extracting text Add space after any close tag to ensure that tokens that would be rendered separate in HTML get extracted as separate tokens in text. Example: `<p>First</p><p>Second</p>` --> `First Second` NOT `FirstSecond`	2023-10-16 09:59:08 -04:00
Ross Williams	d8aa84ac98	Make extracting text for indexing optional Add a configuration option to enable/disable HTML text extraction for indexing	2023-10-12 13:14:39 -04:00
Ross Williams	b6a20c962a	Extract text from singlefile.html when indexing singlefile.html contains a lot of large strings in the form of `data:` URLs, which can be unnecessarily stored in full-text indices. Also, large chunks of JavaScript shouldn't be indexed, either, as they pollute search results for searches about JS functions, etc. This commit takes a blanket approach of parsing singlefile.html as it is read and only outputting text and selected textual attributes (like `alt`) for indexing.	2023-10-12 13:06:35 -04:00
Nick Sweeting	f67a5a215a	fix readability indexing process and implement a max total character length on indexed content	2021-04-06 02:01:38 -04:00
Nick Sweeting	bd6d9c165b	enforce utf8 on literally all file operations because windows sucks	2021-03-27 01:16:29 -04:00
Nick Sweeting	24e24934f7	add headers.json and fix relative singlefile path resolving for sonic	2021-01-30 21:59:34 -05:00
JDC	db9c2edccc	Add log print for url indexing	2020-12-06 01:14:38 +02:00
JDC	caf4660ac8	Add indexing to update command and utilities	2020-12-06 01:14:37 +02:00

9 commits