docs: add agent migration manifest helper (#3028)

* docs: add agent migration manifest helper * fix: use stat+streamed hash for metadata-only archive scans When include_content is false, skip reading full file content and only stat+stream-hash for size and sha256. Avoids spurious skipped- content warnings and keeps large-export previews fast and clean. Closes review feedback on PR #3028. * fix: skip symlinked migration inputs * fix: stream archive traversal warnings * feat: stage conversation threads in agent migration manifests
2026-06-15 09:15:29 -04:00 · 2026-06-15 16:57:33 +10:00
parent 955455b797
commit f23e2e6ffb
3 changed files with 1169 additions and 0 deletions
@@ -0,0 +1,194 @@
+# Agent migration manifests
+
+Odysseus should be able to learn from another agent without blindly trusting
+that agent's whole state. The safe migration path is:
+
+```text
+source agent export -> source adapter -> agent-migration.v1 manifest -> preview -> apply
+```
+
+The manifest is intentionally source-neutral. OpenClaw, Hermes, a folder of
+Markdown notes, or any other agent can have its own adapter, but Odysseus only
+needs to understand the normalized manifest.
+
+## Why not import everything as memory?
+
+Durable memory should stay compact and useful. Long notes, logs, session
+transcripts, and project archives are useful context, but they are not all
+memories. A good migration keeps two layers separate:
+
+- **Archive documents** preserve source material for search, reading, and later
+  extraction.
+- **Memory candidates** are short facts or preferences that can be reviewed
+  before being saved into Odysseus memory.
+
+This keeps Odysseus' existing memory-review flow intact while giving it better
+source material to review.
+
+## Manifest shape
+
+`agent-migration.v1` is a JSON object:
+
+```json
+{
+  "schema_version": "agent-migration.v1",
+  "generated_at": "2026-06-06T00:00:00Z",
+  "source": {
+    "name": "example-agent",
+    "kind": "generic"
+  },
+  "summary": {
+    "item_count": 3,
+    "counts_by_kind": {
+      "memory": 1,
+      "skill": 1,
+      "conversation_thread": 1,
+      "archive_document": 1
+    },
+    "warning_count": 0
+  },
+  "items": [],
+  "warnings": []
+}
+```
+
+Each item has a stable `id`, a `kind`, source metadata, and enough content for a
+future importer to preview it before applying.
+
+Supported item kinds in the first pass:
+
+- `memory` — a candidate memory with `text`, `category`, `source`, and
+  provenance metadata.
+- `skill` — a `SKILL.md` file with content and parsed frontmatter metadata.
+- `conversation_thread` — a normalized transcript thread from an exported chat
+  history. Message content is optional; adapters can preserve only thread
+  metadata, message counts, timestamps, and hashes when a manifest should stay
+  small or avoid embedding private transcript text.
+- `archive_document` — long-form source material. Content is optional; adapters
+  can preserve only path/hash/size metadata when a manifest should stay small.
+
+## Build a manifest
+
+Use the read-only helper:
+
+```bash
+python3 scripts/agent_migration_manifest.py \
+  --source-name old-agent \
+  --source-kind generic \
+  --memory-json /path/to/memories.json \
+  --skills-dir /path/to/skills \
+  --conversation-json /path/to/conversations.json \
+  --archive /path/to/notes \
+  --output /tmp/agent-migration.json
+```
+
+The helper does not write to `data/`, call an LLM, import Odysseus modules, or
+modify the source. It only writes JSON.
+
+Memory JSON may be:
+
+```json
+[
+  "A plain memory string",
+  {
+    "text": "A categorized memory",
+    "category": "preference",
+    "source": "old-agent"
+  }
+]
+```
+
+or an object containing a list under `memories`, `memory`, `items`, or `data`.
+
+Skills are scanned recursively for `SKILL.md`:
+
+```bash
+python3 scripts/agent_migration_manifest.py \
+  --source-name hermes \
+  --source-kind hermes \
+  --skills-dir ~/.hermes/skills \
+  --output /tmp/hermes-skills-manifest.json
+```
+
+Archive documents are metadata-only by default. To embed text content:
+
+```bash
+python3 scripts/agent_migration_manifest.py \
+  --source-name notes-export \
+  --archive /path/to/markdown-notes \
+  --include-archive-content \
+  --output /tmp/notes-manifest.json
+```
+
+Conversation exports are also metadata-only by default:
+
+```bash
+python3 scripts/agent_migration_manifest.py \
+  --source-name chatgpt-export \
+  --source-kind chatgpt \
+  --conversation-json /path/to/conversations.json \
+  --output /tmp/chatgpt-conversations-manifest.json
+```
+
+The first pass supports generic conversation JSON such as:
+
+```json
+[
+  {
+    "id": "thread-1",
+    "title": "Project plan",
+    "messages": [
+      {"role": "user", "content": "Can we design this?"},
+      {"role": "assistant", "content": "Yes, start with a narrow slice."}
+    ]
+  }
+]
+```
+
+It also recognizes ChatGPT-style `mapping` exports from `conversations.json`.
+To embed normalized messages:
+
+```bash
+python3 scripts/agent_migration_manifest.py \
+  --source-name chatgpt-export \
+  --source-kind chatgpt \
+  --conversation-json /path/to/conversations.json \
+  --include-conversation-content \
+  --max-conversation-messages 2000 \
+  --output /tmp/chatgpt-conversations-with-content.json
+```
+
+Content embedding is explicit because exported chat histories can be huge and
+private. A future source-specific adapter can add ZIP traversal, attachment
+metadata, and provider-specific project/workspace fields while still emitting
+the same `conversation_thread` manifest item.
+
+## Recommended apply behavior
+
+A future Odysseus importer should treat the manifest as untrusted user-provided
+data and apply it in stages:
+
+1. Show a dry-run summary with counts, warnings, duplicates, and sample items.
+2. Back up current `data/` state before writing anything.
+3. Import archive documents as documents or another searchable source, not as
+   memory.
+4. Import conversation threads as searchable archived context first, with
+   citations back to the source thread. Do not turn whole transcripts into
+   memory.
+5. Show memory candidates for review before saving through the normal memory
+   path.
+6. Import skills only after name/category conflict checks.
+7. Skip secrets by default. Credentials need explicit, provider-specific flows.
+
+## What belongs in source adapters?
+
+Adapters can be source-specific. The core manifest should not be.
+
+For example, an OpenClaw adapter may know about OpenClaw's workspace files. A
+Hermes adapter may know about `~/.hermes/config.yaml` and `~/.hermes/skills`.
+A ChatGPT adapter may know about `conversations.json`, uploaded-file metadata,
+and image attachment directories. A Claude adapter may know about Claude's
+export shape and project boundaries. A generic adapter may only know about
+memory JSON, conversation JSON, `SKILL.md`, and Markdown folders.
+
+Nonstandard folders should be adapter details, not required Odysseus concepts.
@@ -0,0 +1,635 @@
+#!/usr/bin/env python3
+"""Build a neutral agent migration manifest.
+
+This helper is intentionally read-only. It does not import the Odysseus
+application package, write to data/, call an LLM, or apply anything. It turns
+common agent export shapes into a portable JSON manifest that Odysseus can
+preview or import later.
+"""
+from __future__ import annotations
+
+import argparse
+import hashlib
+import json
+import mimetypes
+import sys
+from dataclasses import dataclass
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any, Iterable
+
+
+SCHEMA_VERSION = "agent-migration.v1"
+TEXT_EXTENSIONS = {
+    ".cfg",
+    ".conf",
+    ".csv",
+    ".json",
+    ".log",
+    ".md",
+    ".markdown",
+    ".py",
+    ".rst",
+    ".toml",
+    ".txt",
+    ".yaml",
+    ".yml",
+}
+
+
+@dataclass(frozen=True)
+class InputWarning:
+    path: str
+    message: str
+
+
+def utc_now_iso() -> str:
+    return datetime.now(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z")
+
+
+def sha256_text(text: str) -> str:
+    return hashlib.sha256(text.encode("utf-8")).hexdigest()
+
+
+def sha256_bytes(data: bytes) -> str:
+    return hashlib.sha256(data).hexdigest()
+
+
+def sha256_path(path: Path) -> str:
+    h = hashlib.sha256()
+    with path.open("rb") as f:
+        for chunk in iter(lambda: f.read(65536), b""):
+            h.update(chunk)
+    return h.hexdigest()
+
+
+def stable_id(kind: str, source_name: str, *parts: Any) -> str:
+    raw = "\x1f".join([kind, source_name, *[str(part) for part in parts]])
+    return f"{kind}:{hashlib.sha256(raw.encode('utf-8')).hexdigest()[:16]}"
+
+
+def read_json(path: Path) -> Any:
+    with path.open("r", encoding="utf-8") as handle:
+        return json.load(handle)
+
+
+def normalize_category(value: Any) -> str:
+    category = str(value or "fact").strip().lower()
+    return category or "fact"
+
+
+def normalize_memory_text(item: Any) -> str:
+    if isinstance(item, str):
+        return item.strip()
+    if isinstance(item, dict):
+        for key in ("text", "content", "memory", "value"):
+            value = item.get(key)
+            if isinstance(value, str) and value.strip():
+                return value.strip()
+    return ""
+
+
+def memory_metadata(item: Any, source_path: Path, index: int) -> dict[str, Any]:
+    metadata: dict[str, Any] = {
+        "source_path": str(source_path),
+        "source_index": index,
+    }
+    if isinstance(item, dict):
+        for key in ("id", "timestamp", "created_at", "updated_at", "source", "tags", "pinned"):
+            if key in item:
+                metadata[f"source_{key}"] = item.get(key)
+    return metadata
+
+
+def payload_items(payload: Any, keys: tuple[str, ...]) -> Any:
+    if isinstance(payload, dict):
+        for key in keys:
+            if isinstance(payload.get(key), list):
+                return payload[key]
+    return payload
+
+
+def collect_memory_json(path: Path, source_name: str) -> tuple[list[dict[str, Any]], list[InputWarning]]:
+    warnings: list[InputWarning] = []
+    try:
+        payload = read_json(path)
+    except Exception as exc:
+        return [], [InputWarning(str(path), f"could not read JSON: {exc}")]
+
+    payload = payload_items(payload, ("memories", "memory", "items", "data"))
+
+    if not isinstance(payload, list):
+        return [], [InputWarning(str(path), "expected a JSON list or an object containing a memory list")]
+
+    items: list[dict[str, Any]] = []
+    seen: set[str] = set()
+    for index, item in enumerate(payload):
+        text = normalize_memory_text(item)
+        if not text:
+            warnings.append(InputWarning(str(path), f"skipped memory at index {index}: missing text"))
+            continue
+        digest = sha256_text(text.strip().lower())
+        if digest in seen:
+            warnings.append(InputWarning(str(path), f"skipped duplicate memory at index {index}"))
+            continue
+        seen.add(digest)
+        category = normalize_category(item.get("category") if isinstance(item, dict) else "fact")
+        source = str(item.get("source") or source_name) if isinstance(item, dict) else source_name
+        items.append(
+            {
+                "id": stable_id("memory", source_name, path, index, digest),
+                "kind": "memory",
+                "text": text,
+                "category": category,
+                "source": source,
+                "metadata": memory_metadata(item, path, index),
+            }
+        )
+    return items, warnings
+
+
+def normalize_timestamp(value: Any) -> str | None:
+    if value is None or value == "":
+        return None
+    if isinstance(value, (int, float)):
+        try:
+            return (
+                datetime.fromtimestamp(float(value), timezone.utc)
+                .replace(microsecond=0)
+                .isoformat()
+                .replace("+00:00", "Z")
+            )
+        except (OverflowError, OSError, ValueError):
+            return str(value)
+    return str(value)
+
+
+def normalize_role(value: Any) -> str:
+    role = str(value or "unknown").strip().lower()
+    if role in {"human", "user"}:
+        return "user"
+    if role in {"assistant", "ai", "bot", "model"}:
+        return "assistant"
+    if role in {"system", "tool"}:
+        return role
+    return role or "unknown"
+
+
+def content_part_text(part: Any) -> str:
+    if isinstance(part, str):
+        return part
+    if isinstance(part, dict):
+        for key in ("text", "content", "value"):
+            value = part.get(key)
+            if isinstance(value, str):
+                return value
+        if part.get("type") == "text" and isinstance(part.get("text"), str):
+            return part["text"]
+    return ""
+
+
+def normalize_message_text(message: dict[str, Any]) -> str:
+    content = message.get("content")
+    if isinstance(content, str):
+        return content
+    if isinstance(content, list):
+        return "\n".join(text for text in (content_part_text(part).strip() for part in content) if text)
+    if isinstance(content, dict):
+        parts = content.get("parts")
+        if isinstance(parts, list):
+            return "\n".join(text for text in (content_part_text(part).strip() for part in parts) if text)
+        for key in ("text", "content", "value"):
+            value = content.get(key)
+            if isinstance(value, str):
+                return value
+    for key in ("text", "body", "message"):
+        value = message.get(key)
+        if isinstance(value, str):
+            return value
+    return ""
+
+
+def normalize_message(message: dict[str, Any]) -> dict[str, Any] | None:
+    author = message.get("author") if isinstance(message.get("author"), dict) else {}
+    role = (
+        message.get("role")
+        or message.get("sender")
+        or message.get("speaker")
+        or author.get("role")
+        or author.get("name")
+    )
+    text = normalize_message_text(message).strip()
+    if not text:
+        return None
+    normalized: dict[str, Any] = {
+        "role": normalize_role(role),
+        "text": text,
+    }
+    timestamp = normalize_timestamp(message.get("created_at") or message.get("create_time") or message.get("timestamp"))
+    if timestamp:
+        normalized["created_at"] = timestamp
+    message_id = message.get("id")
+    if message_id is not None:
+        normalized["source_id"] = str(message_id)
+    return normalized
+
+
+def chatgpt_mapping_messages(conversation: dict[str, Any]) -> list[dict[str, Any]]:
+    mapping = conversation.get("mapping")
+    if not isinstance(mapping, dict):
+        return []
+    rows: list[tuple[float, int, dict[str, Any]]] = []
+    for index, node in enumerate(mapping.values()):
+        if not isinstance(node, dict) or not isinstance(node.get("message"), dict):
+            continue
+        message = node["message"]
+        sort_value = message.get("create_time")
+        try:
+            sort_key = float(sort_value)
+        except (TypeError, ValueError):
+            sort_key = float(index)
+        normalized = normalize_message(message)
+        if normalized:
+            rows.append((sort_key, index, normalized))
+    return [row[2] for row in sorted(rows, key=lambda row: (row[0], row[1]))]
+
+
+def conversation_messages(conversation: dict[str, Any]) -> tuple[list[dict[str, Any]], str]:
+    mapped = chatgpt_mapping_messages(conversation)
+    if mapped:
+        return mapped, "chatgpt_mapping"
+    for key in ("messages", "chat_messages", "turns"):
+        raw_messages = conversation.get(key)
+        if isinstance(raw_messages, list):
+            messages = [
+                normalized
+                for raw in raw_messages
+                if isinstance(raw, dict)
+                for normalized in [normalize_message(raw)]
+                if normalized
+            ]
+            return messages, key
+    return [], "unknown"
+
+
+def conversation_title(conversation: dict[str, Any], index: int) -> str:
+    for key in ("title", "name", "summary"):
+        value = conversation.get(key)
+        if isinstance(value, str) and value.strip():
+            return value.strip()
+    return f"Conversation {index + 1}"
+
+
+def collect_conversation_json(
+    path: Path,
+    source_name: str,
+    *,
+    include_content: bool = False,
+    max_messages: int = 2000,
+) -> tuple[list[dict[str, Any]], list[InputWarning]]:
+    warnings: list[InputWarning] = []
+    try:
+        payload = read_json(path)
+    except Exception as exc:
+        return [], [InputWarning(str(path), f"could not read JSON: {exc}")]
+
+    payload = payload_items(payload, ("conversations", "conversation", "items", "data"))
+    if isinstance(payload, dict):
+        payload = [payload]
+    if not isinstance(payload, list):
+        return [], [InputWarning(str(path), "expected a JSON list or an object containing a conversation list")]
+
+    items: list[dict[str, Any]] = []
+    for index, conversation in enumerate(payload):
+        if not isinstance(conversation, dict):
+            warnings.append(InputWarning(str(path), f"skipped conversation at index {index}: expected object"))
+            continue
+        messages, format_hint = conversation_messages(conversation)
+        if not messages:
+            warnings.append(InputWarning(str(path), f"skipped conversation at index {index}: no text messages found"))
+            continue
+        title = conversation_title(conversation, index)
+        source_id = conversation.get("id") or conversation.get("uuid") or conversation.get("conversation_id")
+        text_digest = sha256_text("\n".join(f"{msg['role']}:{msg['text']}" for msg in messages))
+        metadata: dict[str, Any] = {
+            "source_path": str(path),
+            "source_index": index,
+            "source_format": format_hint,
+            "message_count": len(messages),
+            "text_sha256": text_digest,
+            "content_included": False,
+        }
+        if source_id is not None:
+            metadata["source_id"] = str(source_id)
+        for key in ("create_time", "created_at", "update_time", "updated_at"):
+            timestamp = normalize_timestamp(conversation.get(key))
+            if timestamp:
+                metadata[f"source_{key}"] = timestamp
+        item: dict[str, Any] = {
+            "id": stable_id("conversation", source_name, path, source_id or index, text_digest),
+            "kind": "conversation_thread",
+            "title": title,
+            "source": source_name,
+            "metadata": metadata,
+        }
+        if include_content:
+            if len(messages) > max_messages:
+                warnings.append(
+                    InputWarning(
+                        str(path),
+                        f"skipped conversation content at index {index}: over {max_messages} messages",
+                    )
+                )
+            else:
+                item["messages"] = messages
+                item["metadata"]["content_included"] = True
+        items.append(item)
+    return items, warnings
+
+
+def parse_skill_frontmatter(text: str) -> dict[str, Any]:
+    if not text.startswith("---"):
+        return {}
+    end = text.find("\n---", 3)
+    if end < 0:
+        return {}
+    frontmatter: dict[str, Any] = {}
+    for line in text[3:end].strip().splitlines():
+        if not line.strip() or line.lstrip().startswith("#") or ":" not in line:
+            continue
+        key, value = line.split(":", 1)
+        key = key.strip()
+        value = value.strip().strip('"').strip("'")
+        if key:
+            frontmatter[key] = value
+    return frontmatter
+
+
+def collect_skill_dir(path: Path, source_name: str) -> tuple[list[dict[str, Any]], list[InputWarning]]:
+    warnings: list[InputWarning] = []
+    if path.is_symlink():
+        return [], [InputWarning(str(path), "skills path is a symlink; skipped")]
+    if not path.exists():
+        return [], [InputWarning(str(path), "skills directory does not exist")]
+    if not path.is_dir():
+        return [], [InputWarning(str(path), "skills path is not a directory")]
+
+    items: list[dict[str, Any]] = []
+    for skill_path in sorted(path.rglob("SKILL.md")):
+        if skill_path.is_symlink():
+            warnings.append(InputWarning(str(skill_path), "skipped symlinked skill file"))
+            continue
+        try:
+            text = skill_path.read_text(encoding="utf-8")
+        except Exception as exc:
+            warnings.append(InputWarning(str(skill_path), f"could not read skill: {exc}"))
+            continue
+        frontmatter = parse_skill_frontmatter(text)
+        name = str(frontmatter.get("name") or skill_path.parent.name).strip() or skill_path.parent.name
+        items.append(
+            {
+                "id": stable_id("skill", source_name, skill_path, sha256_text(text)),
+                "kind": "skill",
+                "name": name,
+                "category": str(frontmatter.get("category") or "general"),
+                "source": source_name,
+                "format": "SKILL.md",
+                "content": text,
+                "metadata": {
+                    "source_path": str(skill_path),
+                    "sha256": sha256_text(text),
+                    "frontmatter": frontmatter,
+                },
+            }
+        )
+    return items, warnings
+
+
+def looks_textual(path: Path) -> bool:
+    if path.suffix.lower() in TEXT_EXTENSIONS:
+        return True
+    guessed, _ = mimetypes.guess_type(str(path))
+    return bool(guessed and (guessed.startswith("text/") or guessed in {"application/json"}))
+
+
+def iter_archive_dir(path: Path) -> Iterable[Path | InputWarning]:
+    try:
+        children = sorted(path.iterdir())
+    except Exception as exc:
+        yield InputWarning(str(path), f"could not scan archive directory: {exc}")
+        return
+    for child in children:
+        if child.is_symlink():
+            yield InputWarning(str(child), "skipped symlinked archive path")
+            continue
+        if child.is_file():
+            yield child
+        elif child.is_dir():
+            yield from iter_archive_dir(child)
+
+
+def iter_archive_files(paths: Iterable[Path]) -> Iterable[Path | InputWarning]:
+    for path in paths:
+        if path.is_symlink():
+            yield InputWarning(str(path), "skipped symlinked archive path")
+            continue
+        if path.is_file():
+            yield path
+        elif path.is_dir():
+            yield from iter_archive_dir(path)
+
+
+def collect_archive_paths(
+    paths: list[Path],
+    source_name: str,
+    *,
+    include_content: bool = False,
+    max_bytes: int = 256_000,
+) -> tuple[list[dict[str, Any]], list[InputWarning]]:
+    warnings: list[InputWarning] = []
+    items: list[dict[str, Any]] = []
+    existing_paths: list[Path] = []
+    for path in paths:
+        if path.is_symlink():
+            warnings.append(InputWarning(str(path), "archive path is a symlink; skipped"))
+            continue
+        if not path.exists():
+            warnings.append(InputWarning(str(path), "archive path does not exist"))
+            continue
+        if not path.is_file() and not path.is_dir():
+            warnings.append(InputWarning(str(path), "archive path is not a file or directory"))
+            continue
+        existing_paths.append(path)
+
+    for entry in iter_archive_files(existing_paths):
+        if isinstance(entry, InputWarning):
+            warnings.append(entry)
+            continue
+        path = entry
+        if not looks_textual(path):
+            warnings.append(InputWarning(str(path), "skipped non-text archive file"))
+            continue
+        try:
+            st = path.stat()
+        except Exception as exc:
+            warnings.append(InputWarning(str(path), f"could not stat archive file: {exc}"))
+            continue
+        size = st.st_size
+        try:
+            file_hash = sha256_path(path)
+        except Exception as exc:
+            warnings.append(InputWarning(str(path), f"could not hash archive file: {exc}"))
+            continue
+        if include_content and size > max_bytes:
+            warnings.append(InputWarning(str(path), f"skipped archive content over {max_bytes} bytes"))
+        archive_item: dict[str, Any] = {
+            "id": stable_id("archive", source_name, path, file_hash),
+            "kind": "archive_document",
+            "title": path.name,
+            "source": source_name,
+            "metadata": {
+                "source_path": str(path),
+                "size_bytes": size,
+                "sha256": file_hash,
+            },
+        }
+        if include_content and size <= max_bytes:
+            try:
+                archive_item["content"] = path.read_text(encoding="utf-8")
+            except UnicodeDecodeError:
+                archive_item["content"] = path.read_text(encoding="utf-8", errors="replace")
+                archive_item["metadata"]["decoded_with_replacement"] = True
+        items.append(archive_item)
+    return items, warnings
+
+
+def build_manifest(args) -> dict[str, Any]:
+    warnings: list[InputWarning] = []
+    items: list[dict[str, Any]] = []
+
+    for path in args.memory_json:
+        collected, got_warnings = collect_memory_json(path, args.source_name)
+        items.extend(collected)
+        warnings.extend(got_warnings)
+
+    for path in args.skills_dir:
+        collected, got_warnings = collect_skill_dir(path, args.source_name)
+        items.extend(collected)
+        warnings.extend(got_warnings)
+
+    for path in args.conversation_json:
+        collected, got_warnings = collect_conversation_json(
+            path,
+            args.source_name,
+            include_content=args.include_conversation_content,
+            max_messages=args.max_conversation_messages,
+        )
+        items.extend(collected)
+        warnings.extend(got_warnings)
+
+    if args.archive:
+        collected, got_warnings = collect_archive_paths(
+            args.archive,
+            args.source_name,
+            include_content=args.include_archive_content,
+            max_bytes=args.max_archive_bytes,
+        )
+        items.extend(collected)
+        warnings.extend(got_warnings)
+
+    counts: dict[str, int] = {}
+    for item in items:
+        counts[item["kind"]] = counts.get(item["kind"], 0) + 1
+
+    return {
+        "schema_version": SCHEMA_VERSION,
+        "generated_at": utc_now_iso(),
+        "source": {
+            "name": args.source_name,
+            "kind": args.source_kind,
+        },
+        "summary": {
+            "item_count": len(items),
+            "counts_by_kind": counts,
+            "warning_count": len(warnings),
+        },
+        "items": items,
+        "warnings": [{"path": warning.path, "message": warning.message} for warning in warnings],
+    }
+
+
+def parse_args(argv: list[str] | None = None):
+    parser = argparse.ArgumentParser(description="Build a neutral Odysseus agent migration manifest.")
+    parser.add_argument("--source-name", default="agent-export", help="Human-readable source name.")
+    parser.add_argument("--source-kind", default="generic", help="Source adapter kind, e.g. generic, openclaw, hermes.")
+    parser.add_argument(
+        "--memory-json",
+        action="append",
+        type=Path,
+        default=[],
+        help="JSON memory export. May be a list, or an object containing memories/items/data.",
+    )
+    parser.add_argument(
+        "--skills-dir",
+        action="append",
+        type=Path,
+        default=[],
+        help="Directory containing SKILL.md files. Scanned recursively.",
+    )
+    parser.add_argument(
+        "--archive",
+        action="append",
+        type=Path,
+        default=[],
+        help="Text/Markdown/JSON file or directory to preserve as archive documents.",
+    )
+    parser.add_argument(
+        "--conversation-json",
+        action="append",
+        type=Path,
+        default=[],
+        help="Conversation export JSON. Supports generic message lists and ChatGPT-style conversations.json.",
+    )
+    parser.add_argument(
+        "--include-archive-content",
+        action="store_true",
+        help="Embed archive document content in the manifest. By default only metadata is included.",
+    )
+    parser.add_argument(
+        "--max-archive-bytes",
+        type=int,
+        default=256_000,
+        help="Maximum bytes to embed per archive file when --include-archive-content is used.",
+    )
+    parser.add_argument(
+        "--include-conversation-content",
+        action="store_true",
+        help="Embed normalized conversation messages. By default only thread metadata is included.",
+    )
+    parser.add_argument(
+        "--max-conversation-messages",
+        type=int,
+        default=2000,
+        help="Maximum messages to embed per conversation when --include-conversation-content is used.",
+    )
+    parser.add_argument("--output", type=Path, help="Write manifest JSON to this path instead of stdout.")
+    parser.add_argument("--compact", action="store_true", help="Write compact JSON without indentation.")
+    return parser.parse_args(argv)
+
+
+def main(argv: list[str] | None = None) -> int:
+    args = parse_args(argv)
+    manifest = build_manifest(args)
+    text = json.dumps(manifest, ensure_ascii=False, sort_keys=True, separators=(",", ":")) if args.compact else (
+        json.dumps(manifest, ensure_ascii=False, indent=2, sort_keys=True) + "\n"
+    )
+    if args.output:
+        args.output.parent.mkdir(parents=True, exist_ok=True)
+        args.output.write_text(text, encoding="utf-8")
+    else:
+        sys.stdout.write(text)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
@@ -0,0 +1,340 @@
+import importlib.util
+import json
+import sys
+from pathlib import Path
+
+
+ROOT = Path(__file__).resolve().parents[1]
+SCRIPT_PATH = ROOT / "scripts" / "agent_migration_manifest.py"
+
+
+def load_module():
+    spec = importlib.util.spec_from_file_location("agent_migration_manifest", SCRIPT_PATH)
+    module = importlib.util.module_from_spec(spec)
+    sys.modules[spec.name] = module
+    spec.loader.exec_module(module)
+    return module
+
+
+def test_collect_memory_json_accepts_strings_and_objects(tmp_path):
+    migration = load_module()
+    path = tmp_path / "memories.json"
+    path.write_text(
+        json.dumps(
+            [
+                "Pacey prefers GLM for routine coding.",
+                {"text": "Odysseus runs on a self-hosted machine.", "category": "project", "source": "manual"},
+                {"content": "Duplicate source keys still work.", "category": "fact"},
+            ]
+        ),
+        encoding="utf-8",
+    )
+
+    items, warnings = migration.collect_memory_json(path, "example-agent")
+
+    assert [item["kind"] for item in items] == ["memory", "memory", "memory"]
+    assert items[0]["category"] == "fact"
+    assert items[1]["category"] == "project"
+    assert items[1]["source"] == "manual"
+    assert warnings == []
+
+
+def test_collect_memory_json_deduplicates_exact_text(tmp_path):
+    migration = load_module()
+    path = tmp_path / "memories.json"
+    path.write_text(json.dumps(["Same memory", {"text": "Same memory"}]), encoding="utf-8")
+
+    items, warnings = migration.collect_memory_json(path, "example-agent")
+
+    assert len(items) == 1
+    assert warnings[0].message == "skipped duplicate memory at index 1"
+
+
+def test_collect_skill_dir_scans_skill_markdown(tmp_path):
+    migration = load_module()
+    skill_path = tmp_path / "skills" / "dev" / "git-helper" / "SKILL.md"
+    skill_path.parent.mkdir(parents=True)
+    skill_path.write_text(
+        """---
+name: git-helper
+category: dev
+---
+
+## When to Use
+Use for focused git checks.
+""",
+        encoding="utf-8",
+    )
+
+    items, warnings = migration.collect_skill_dir(tmp_path / "skills", "example-agent")
+
+    assert len(items) == 1
+    assert warnings == []
+    assert items[0]["kind"] == "skill"
+    assert items[0]["name"] == "git-helper"
+    assert items[0]["category"] == "dev"
+    assert items[0]["format"] == "SKILL.md"
+    assert "## When to Use" in items[0]["content"]
+
+
+def test_collect_skill_dir_skips_symlinked_skill_markdown(tmp_path):
+    migration = load_module()
+    outside = tmp_path / "outside.md"
+    outside.write_text("private skill content", encoding="utf-8")
+    skill_path = tmp_path / "skills" / "bad" / "SKILL.md"
+    skill_path.parent.mkdir(parents=True)
+    skill_path.symlink_to(outside)
+
+    items, warnings = migration.collect_skill_dir(tmp_path / "skills", "example-agent")
+
+    assert items == []
+    assert warnings[0].message == "skipped symlinked skill file"
+
+
+def test_collect_skill_dir_skips_symlinked_root(tmp_path):
+    migration = load_module()
+    real_skills = tmp_path / "real-skills"
+    real_skills.mkdir()
+    linked_skills = tmp_path / "skills"
+    linked_skills.symlink_to(real_skills, target_is_directory=True)
+
+    items, warnings = migration.collect_skill_dir(linked_skills, "example-agent")
+
+    assert items == []
+    assert warnings[0].message == "skills path is a symlink; skipped"
+
+
+def test_archive_content_is_optional(tmp_path):
+    migration = load_module()
+    archive = tmp_path / "notes.md"
+    archive.write_text("# Notes\n\nUseful context.", encoding="utf-8")
+
+    metadata_only, _ = migration.collect_archive_paths([archive], "example-agent")
+    with_content, _ = migration.collect_archive_paths([archive], "example-agent", include_content=True)
+
+    assert metadata_only[0]["kind"] == "archive_document"
+    assert "content" not in metadata_only[0]
+    assert with_content[0]["content"].startswith("# Notes")
+
+
+def test_archive_skips_symlinked_file(tmp_path):
+    migration = load_module()
+    outside = tmp_path / "outside.md"
+    outside.write_text("private archive content", encoding="utf-8")
+    archive_dir = tmp_path / "archive"
+    archive_dir.mkdir()
+    linked_file = archive_dir / "leak.md"
+    linked_file.symlink_to(outside)
+
+    items, warnings = migration.collect_archive_paths([archive_dir], "example-agent", include_content=True)
+
+    assert items == []
+    assert warnings[0].message == "skipped symlinked archive path"
+
+
+def test_archive_skips_symlinked_root(tmp_path):
+    migration = load_module()
+    archive = tmp_path / "notes.md"
+    archive.write_text("# Notes\n\nUseful context.", encoding="utf-8")
+    linked_archive = tmp_path / "linked-notes.md"
+    linked_archive.symlink_to(archive)
+
+    items, warnings = migration.collect_archive_paths([linked_archive], "example-agent", include_content=True)
+
+    assert items == []
+    assert warnings[0].message == "archive path is a symlink; skipped"
+
+
+def test_conversation_json_imports_generic_threads_metadata_only(tmp_path):
+    migration = load_module()
+    path = tmp_path / "conversations.json"
+    path.write_text(
+        json.dumps(
+            {
+                "conversations": [
+                    {
+                        "id": "thread-1",
+                        "title": "Project plan",
+                        "created_at": "2026-06-01T00:00:00Z",
+                        "messages": [
+                            {"role": "user", "content": "Can we design this?"},
+                            {"role": "assistant", "content": "Yes, start with a narrow slice."},
+                        ],
+                    }
+                ]
+            }
+        ),
+        encoding="utf-8",
+    )
+
+    items, warnings = migration.collect_conversation_json(path, "example-agent")
+
+    assert warnings == []
+    assert len(items) == 1
+    assert items[0]["kind"] == "conversation_thread"
+    assert items[0]["title"] == "Project plan"
+    assert items[0]["metadata"]["source_id"] == "thread-1"
+    assert items[0]["metadata"]["message_count"] == 2
+    assert items[0]["metadata"]["content_included"] is False
+    assert "messages" not in items[0]
+
+
+def test_conversation_json_can_embed_generic_thread_content(tmp_path):
+    migration = load_module()
+    path = tmp_path / "conversations.json"
+    path.write_text(
+        json.dumps(
+            [
+                {
+                    "title": "Preference",
+                    "messages": [
+                        {"sender": "human", "content": [{"type": "text", "text": "Use terse replies."}]},
+                        {"sender": "ai", "text": "Noted."},
+                    ],
+                }
+            ]
+        ),
+        encoding="utf-8",
+    )
+
+    items, warnings = migration.collect_conversation_json(path, "example-agent", include_content=True)
+
+    assert warnings == []
+    assert items[0]["metadata"]["content_included"] is True
+    assert items[0]["messages"] == [
+        {"role": "user", "text": "Use terse replies."},
+        {"role": "assistant", "text": "Noted."},
+    ]
+
+
+def test_conversation_json_imports_chatgpt_mapping_ordered_by_time(tmp_path):
+    migration = load_module()
+    path = tmp_path / "conversations.json"
+    path.write_text(
+        json.dumps(
+            [
+                {
+                    "id": "chatgpt-thread",
+                    "title": "ChatGPT export",
+                    "mapping": {
+                        "b": {
+                            "message": {
+                                "id": "m2",
+                                "create_time": 20,
+                                "author": {"role": "assistant"},
+                                "content": {"content_type": "text", "parts": ["Second"]},
+                            }
+                        },
+                        "a": {
+                            "message": {
+                                "id": "m1",
+                                "create_time": 10,
+                                "author": {"role": "user"},
+                                "content": {"content_type": "text", "parts": ["First"]},
+                            }
+                        },
+                    },
+                }
+            ]
+        ),
+        encoding="utf-8",
+    )
+
+    items, warnings = migration.collect_conversation_json(path, "chatgpt", include_content=True)
+
+    assert warnings == []
+    assert items[0]["metadata"]["source_format"] == "chatgpt_mapping"
+    assert items[0]["messages"] == [
+        {"role": "user", "text": "First", "created_at": "1970-01-01T00:00:10Z", "source_id": "m1"},
+        {"role": "assistant", "text": "Second", "created_at": "1970-01-01T00:00:20Z", "source_id": "m2"},
+    ]
+
+
+def test_conversation_content_respects_message_limit(tmp_path):
+    migration = load_module()
+    path = tmp_path / "conversations.json"
+    path.write_text(
+        json.dumps(
+            [
+                {
+                    "title": "Long thread",
+                    "messages": [
+                        {"role": "user", "content": "one"},
+                        {"role": "assistant", "content": "two"},
+                    ],
+                }
+            ]
+        ),
+        encoding="utf-8",
+    )
+
+    items, warnings = migration.collect_conversation_json(
+        path,
+        "example-agent",
+        include_content=True,
+        max_messages=1,
+    )
+
+    assert "messages" not in items[0]
+    assert items[0]["metadata"]["content_included"] is False
+    assert warnings[0].message == "skipped conversation content at index 0: over 1 messages"
+
+
+def test_archive_missing_path_warns(tmp_path):
+    migration = load_module()
+    missing = tmp_path / "missing"
+
+    items, warnings = migration.collect_archive_paths([missing], "example-agent")
+
+    assert items == []
+    assert warnings[0].message == "archive path does not exist"
+
+
+def test_main_writes_manifest_with_conversation_thread(tmp_path):
+    migration = load_module()
+    conversation_path = tmp_path / "conversations.json"
+    output_path = tmp_path / "manifest.json"
+    conversation_path.write_text(
+        json.dumps([{"title": "A thread", "messages": [{"role": "user", "content": "hello"}]}]),
+        encoding="utf-8",
+    )
+
+    exit_code = migration.main(
+        [
+            "--source-name",
+            "example-agent",
+            "--conversation-json",
+            str(conversation_path),
+            "--output",
+            str(output_path),
+        ]
+    )
+    manifest = json.loads(output_path.read_text(encoding="utf-8"))
+
+    assert exit_code == 0
+    assert manifest["summary"]["counts_by_kind"] == {"conversation_thread": 1}
+    assert manifest["items"][0]["title"] == "A thread"
+
+
+def test_main_writes_manifest(tmp_path):
+    migration = load_module()
+    memory_path = tmp_path / "memories.json"
+    output_path = tmp_path / "manifest.json"
+    memory_path.write_text(json.dumps([{"text": "A useful fact", "category": "fact"}]), encoding="utf-8")
+
+    exit_code = migration.main(
+        [
+            "--source-name",
+            "example-agent",
+            "--memory-json",
+            str(memory_path),
+            "--output",
+            str(output_path),
+        ]
+    )
+    manifest = json.loads(output_path.read_text(encoding="utf-8"))
+
+    assert exit_code == 0
+    assert manifest["schema_version"] == "agent-migration.v1"
+    assert manifest["summary"]["counts_by_kind"] == {"memory": 1}
+    assert manifest["items"][0]["text"] == "A useful fact"