mirror of
https://github.com/pewdiepie-archdaemon/odysseus.git
synced 2026-06-16 09:45:24 -04:00
Removing one RAG directory destroyed the whole shared ChromaDB collection
(all owners + base index) instead of just that directory's chunks. Shared
root cause: PersonalDocsManager.remove_directory called rebuild_index()
(delete_collection + recreate) then re-indexed only the remaining tracked
dirs (ownerless, never personal_dir). The targeted VectorRAG.remove_directory
that should have been used was itself broken (where={"source":{"$contains":dir}}
selects nothing on scalar metadata and would over-delete siblings), and the
dead do_manage_rag path fired a second unconditional rebuild.
- VectorRAG.remove_directory: select chunks in Python by a path-boundary match
on the stored absolute `source` (dir or dir+os.sep), abspath-normalized.
Keys on `source` (always written), never `owner` -- no migration.
- PersonalDocsManager.remove_directory: call the targeted remove instead of
rebuild_index() + partial reindex.
- do_manage_rag (dead code): drop the second rebuild_index() (hygiene).
- rag_server.py add path: abspath so indexed `source` matches the remove.
No schema change. Prevents future wipes (does not recover already-wiped
vectors). Adds hermetic regression tests at three layers.
Fixes #1660
Co-authored-by: Ethan <23321960+0xLeathery@users.noreply.github.com>
This commit is contained in:
+25
-9
@@ -380,20 +380,36 @@ class VectorRAG:
|
||||
return {'success': False, 'indexed_count': indexed, 'failed_count': failed, 'message': str(e)}
|
||||
|
||||
def remove_directory(self, directory: str) -> Dict[str, Any]:
|
||||
"""Remove all chunks from a directory. O(1) per chunk via ChromaDB."""
|
||||
"""Remove all chunks under ``directory`` (recursively), and nothing else.
|
||||
|
||||
Selection is a Python-side path-boundary match on each chunk's stored
|
||||
``source`` full path, NOT a Chroma metadata ``where`` filter. No Chroma
|
||||
metadata operator selects a scalar string by path prefix (``$contains``
|
||||
targets document content / list membership, not a ``source`` substring),
|
||||
and a plain substring would over-delete siblings — removing ``/docs``
|
||||
must not touch ``/docs2`` or ``/docs_personal``. We therefore match
|
||||
``source == directory`` or ``source`` startswith ``directory + os.sep``,
|
||||
the same boundary rule add_directory uses for exclusions. ``directory``
|
||||
is abspath-normalized so it matches the absolute ``source`` that indexing
|
||||
always stores, regardless of how the caller passed it in.
|
||||
"""
|
||||
if not self.healthy:
|
||||
return {"success": False, "message": "Collection not initialized"}
|
||||
directory = os.path.abspath(directory)
|
||||
try:
|
||||
# Use ChromaDB where filter to find all docs from this directory
|
||||
results = self._collection.get(
|
||||
where={"source": {"$contains": directory}} if "/" in directory else {"directory": directory},
|
||||
include=["metadatas"],
|
||||
)
|
||||
if not results['ids']:
|
||||
results = self._collection.get(include=["metadatas"])
|
||||
ids = [
|
||||
results["ids"][i]
|
||||
for i, m in enumerate(results["metadatas"])
|
||||
if isinstance(m, dict)
|
||||
and isinstance(m.get("source"), str)
|
||||
and (m["source"] == directory or m["source"].startswith(directory + os.sep))
|
||||
]
|
||||
if not ids:
|
||||
return {"success": True, "removed_count": 0, "message": "No docs found"}
|
||||
|
||||
self._collection.delete(ids=results['ids'])
|
||||
n = len(results['ids'])
|
||||
self._collection.delete(ids=ids)
|
||||
n = len(ids)
|
||||
logger.info(f"Removed {n} chunks from {directory}")
|
||||
return {"success": True, "removed_count": n, "message": f"Removed {n} chunks"}
|
||||
except Exception as e:
|
||||
|
||||
Reference in New Issue
Block a user