fix(chat): keep balanced trailing ')' when extracting URLs (#3406)

extract_urls() stripped any trailing ')' unconditionally via
`re.sub(r'[.,;:!?\)]+$', '', url)`. That corrupts URLs that legitimately
end in a parenthesis — most commonly Wikipedia disambiguation links like
https://en.wikipedia.org/wiki/Python_(programming_language), which became
...Python_(programming_language and then 404 when fetched by the web/research
tools.

Strip trailing sentence punctuation as before, but only drop a ')' when it is
unbalanced (more ')' than '('), so a prose-glued "(see https://example.com)"
still loses its closing paren while balanced URLs keep theirs.

Added tests/test_extract_urls.py covering balanced, unbalanced, nested, and
trailing-punctuation cases.
This commit is contained in:
Mazen Tamer Salah
2026-06-08 22:33:29 +03:00
committed by GitHub
parent 932b7f2446
commit 8e494cc1c4
2 changed files with 46 additions and 1 deletions
+8 -1
View File
@@ -24,7 +24,14 @@ def extract_urls(text: str) -> List[str]:
urls = re.findall(url_pattern, text)
cleaned_urls = []
for url in urls:
url = re.sub(r'[.,;:!?\)]+$', '', url)
# Strip trailing sentence punctuation, but keep a balanced ')' so URLs
# that legitimately end in one are preserved, e.g. the Wikipedia link
# ".../Python_(programming_language)". A ')' is only dropped when it is
# unbalanced (more ')' than '('), which is the prose-glued case such as
# "(see https://example.com)".
url = re.sub(r'[.,;:!?]+$', '', url)
while url.endswith(')') and url.count(')') > url.count('('):
url = re.sub(r'[.,;:!?]+$', '', url[:-1])
cleaned_urls.append(url)
return cleaned_urls