mirror of
https://github.com/pewdiepie-archdaemon/odysseus.git
synced 2026-06-15 17:25:26 -04:00
Add optional markitdown extraction for Office/EPUB documents (#766)
Office documents were dropped server-side: .docx fell through to "[Attached document file]", .xlsx/.pptx weren't recognized at all, and the personal-docs RAG index only covered txt/md/json/pdf. Wire the optional markitdown dependency (MIT, Microsoft) into both the chat-attachment path (build_user_content) and the RAG indexer (personal_docs), converting .docx/.xlsx/.pptx/.xls/.epub to Markdown. It is lazy-imported with graceful fallback (mirrors src/pdf_runtime.py): without it those formats show an "install to extract" banner and the MIT core is unaffected. pypdf stays the default PDF path. - src/markitdown_runtime.py: optional-dep loader + convert_to_markdown - upload_handler: recognize Office/EPUB extensions + MIME types - document_processor: extract Office docs in the chat else-branch - personal_docs: index Office docs (DEFAULT_EXTENSIONS + dispatch) - requirements-optional.txt + ACKNOWLEDGMENTS.md: pinned markitdown 0.1.5 - tests: markitdown_runtime + office index coverage Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
committed by
GitHub
parent
610968f91e
commit
f58fbc8b85
@@ -23,3 +23,14 @@ duckduckgo-search
|
||||
# network-served app — see ACKNOWLEDGMENTS.md. The MIT core (PDF *text*
|
||||
# extraction via pypdf) works without it; this only unlocks form-filling.
|
||||
PyMuPDF
|
||||
|
||||
# Office / EPUB document text extraction (chat attachments + the personal-docs
|
||||
# RAG index). markitdown (MIT, Microsoft) converts .docx/.xlsx/.pptx/.xls/.epub
|
||||
# to Markdown — more token-efficient and model-legible than a raw dump. Optional
|
||||
# and lazy-imported via src/markitdown_runtime.py; without it those formats fall
|
||||
# back to a friendly "install to extract" banner and the core stays pure-MIT.
|
||||
# Extras pull mammoth/lxml/python-pptx/pandas/openpyxl/xlrd; the base also pulls
|
||||
# magika (onnxruntime), already a core dep via fastembed. We avoid the
|
||||
# [all]/Azure/audio extras (cloud + heavy). Pinned to a release >30 days old per
|
||||
# the dependency-age discussion in issue #485.
|
||||
markitdown[docx,pptx,xlsx,xls]==0.1.5
|
||||
|
||||
Reference in New Issue
Block a user