mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2025-05-13 14:44:29 -04:00
Fix HTML title parsing bugs.
This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors. The first occurred when title tags were empty (e.g. "<title></title>") which was parsed as "</title". The second occurred when titles were a single character (e.g. "<title>A</title>") which was not matched by the regex, and so would fall back to link.base_url. Now when tags are empty, it falls back to link.base_url, and single character titles are parsed correctly. The way the regex works now is still a bit wonky for some edge cases. I couldn't find any cases of incorrect behavior, but it still might be worth reworking more completely for robustness.
This commit is contained in:
parent
4950cee3b6
commit
77917e9b55
1 changed files with 1 additions and 1 deletions
|
@ -26,7 +26,7 @@ from ..logging_util import TimedProgress
|
||||||
|
|
||||||
HTML_TITLE_REGEX = re.compile(
|
HTML_TITLE_REGEX = re.compile(
|
||||||
r'<title.*?>' # start matching text after <title> tag
|
r'<title.*?>' # start matching text after <title> tag
|
||||||
r'(.[^<>]+)', # get everything up to these symbols
|
r'([^<>]+)', # get everything up to these symbols
|
||||||
re.IGNORECASE | re.MULTILINE | re.DOTALL | re.UNICODE,
|
re.IGNORECASE | re.MULTILINE | re.DOTALL | re.UNICODE,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue