Add optional markitdown extraction for Office/EPUB documents (#766)

Office documents were dropped server-side: .docx fell through to
"[Attached document file]", .xlsx/.pptx weren't recognized at all, and
the personal-docs RAG index only covered txt/md/json/pdf.

Wire the optional markitdown dependency (MIT, Microsoft) into both the
chat-attachment path (build_user_content) and the RAG indexer
(personal_docs), converting .docx/.xlsx/.pptx/.xls/.epub to Markdown.
It is lazy-imported with graceful fallback (mirrors src/pdf_runtime.py):
without it those formats show an "install to extract" banner and the
MIT core is unaffected. pypdf stays the default PDF path.

- src/markitdown_runtime.py: optional-dep loader + convert_to_markdown
- upload_handler: recognize Office/EPUB extensions + MIME types
- document_processor: extract Office docs in the chat else-branch
- personal_docs: index Office docs (DEFAULT_EXTENSIONS + dispatch)
- requirements-optional.txt + ACKNOWLEDGMENTS.md: pinned markitdown 0.1.5
- tests: markitdown_runtime + office index coverage

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Marius Oppedal Ringsby
2026-06-02 04:28:52 +02:00
committed by GitHub
parent 610968f91e
commit f58fbc8b85
8 changed files with 241 additions and 4 deletions
+4
View File
@@ -118,6 +118,7 @@ Core (`requirements.txt`) and optional (`requirements-optional.txt`):
| croniter | MIT |
| pytest / pytest-asyncio | MIT / Apache-2.0 |
| duckduckgo-search (optional) | MIT |
| markitdown (optional — Office/EPUB text extraction) | MIT |
| **PyMuPDF** *(optional — form-filling only)* | **AGPL-3.0** — see note below |
## Companion services (interoperated with, not bundled)
@@ -152,6 +153,9 @@ concerns from earlier are resolved:
deployment (Artifex also sells a commercial PyMuPDF license that lifts this).
- **`caldav`** (Python lib) is **dual-licensed GPL-3.0-or-later OR Apache-2.0**.
Odysseus uses it under **Apache-2.0**, which is permissive and MIT-compatible.
- **`markitdown`** (Microsoft) is **MIT** and used only as an *optional* dependency for Office/EPUB text
extraction (`src/markitdown_runtime.py`), lazy-imported with graceful fallback — the MIT core runs without
it. The cloud `az-doc-intel` extra is deliberately **not** installed, keeping extraction fully local.
---