mirror of
https://github.com/pewdiepie-archdaemon/odysseus.git
synced 2026-06-15 17:25:26 -04:00
Add optional markitdown extraction for Office/EPUB documents (#766)
Office documents were dropped server-side: .docx fell through to "[Attached document file]", .xlsx/.pptx weren't recognized at all, and the personal-docs RAG index only covered txt/md/json/pdf. Wire the optional markitdown dependency (MIT, Microsoft) into both the chat-attachment path (build_user_content) and the RAG indexer (personal_docs), converting .docx/.xlsx/.pptx/.xls/.epub to Markdown. It is lazy-imported with graceful fallback (mirrors src/pdf_runtime.py): without it those formats show an "install to extract" banner and the MIT core is unaffected. pypdf stays the default PDF path. - src/markitdown_runtime.py: optional-dep loader + convert_to_markdown - upload_handler: recognize Office/EPUB extensions + MIME types - document_processor: extract Office docs in the chat else-branch - personal_docs: index Office docs (DEFAULT_EXTENSIONS + dispatch) - requirements-optional.txt + ACKNOWLEDGMENTS.md: pinned markitdown 0.1.5 - tests: markitdown_runtime + office index coverage Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
committed by
GitHub
parent
610968f91e
commit
f58fbc8b85
@@ -118,6 +118,7 @@ Core (`requirements.txt`) and optional (`requirements-optional.txt`):
|
||||
| croniter | MIT |
|
||||
| pytest / pytest-asyncio | MIT / Apache-2.0 |
|
||||
| duckduckgo-search (optional) | MIT |
|
||||
| markitdown (optional — Office/EPUB text extraction) | MIT |
|
||||
| **PyMuPDF** *(optional — form-filling only)* | **AGPL-3.0** — see note below |
|
||||
|
||||
## Companion services (interoperated with, not bundled)
|
||||
@@ -152,6 +153,9 @@ concerns from earlier are resolved:
|
||||
deployment (Artifex also sells a commercial PyMuPDF license that lifts this).
|
||||
- **`caldav`** (Python lib) is **dual-licensed GPL-3.0-or-later OR Apache-2.0**.
|
||||
Odysseus uses it under **Apache-2.0**, which is permissive and MIT-compatible.
|
||||
- **`markitdown`** (Microsoft) is **MIT** and used only as an *optional* dependency for Office/EPUB text
|
||||
extraction (`src/markitdown_runtime.py`), lazy-imported with graceful fallback — the MIT core runs without
|
||||
it. The cloud `az-doc-intel` extra is deliberately **not** installed, keeping extraction fully local.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user