Commit graph

9 commits

Author SHA1 Message Date
Ross Williams
310b4d1242 Add htmltotext extractor
Saves HTML text nodes and selected element attributes in
`htmltotext.txt` for each Snapshot. Primarily intended to be used
for search indexing.
2023-10-23 21:42:32 -04:00
Ross Williams
6555719489 Add space after tags when extracting text
Add space after any close tag to ensure that
tokens that would be rendered separate in HTML
get extracted as separate tokens in text.

Example:

`<p>First</p><p>Second</p>` --> `First Second`
NOT `FirstSecond`
2023-10-16 09:59:08 -04:00
Ross Williams
d8aa84ac98 Make extracting text for indexing optional
Add a configuration option to enable/disable HTML text extraction
for indexing
2023-10-12 13:14:39 -04:00
Ross Williams
b6a20c962a Extract text from singlefile.html when indexing
singlefile.html contains a lot of large strings in the form of `data:`
URLs, which can be unnecessarily stored in full-text indices. Also,
large chunks of JavaScript shouldn't be indexed, either, as they pollute
search results for searches about JS functions, etc.

This commit takes a blanket approach of parsing singlefile.html as it is
read and only outputting text and selected textual attributes (like
`alt`) for indexing.
2023-10-12 13:06:35 -04:00
Nick Sweeting
f67a5a215a fix readability indexing process and implement a max total character length on indexed content 2021-04-06 02:01:38 -04:00
Nick Sweeting
bd6d9c165b enforce utf8 on literally all file operations because windows sucks 2021-03-27 01:16:29 -04:00
Nick Sweeting
24e24934f7 add headers.json and fix relative singlefile path resolving for sonic 2021-01-30 21:59:34 -05:00
JDC
db9c2edccc Add log print for url indexing 2020-12-06 01:14:38 +02:00
JDC
caf4660ac8 Add indexing to update command and utilities 2020-12-06 01:14:37 +02:00