metal-kompanion/docs/anything-llm-eval.md

64 lines
5.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# AnythingLLM ↔ Kompanion Memory Compatibility Evaluation
## Current Kompanion Memory Stack
- **Primary store**: Postgres 14+ with `pgvector` ≥ 0.6, accessed via the C++ `PgDal` implementation (`embeddings`, `memory_chunks`, `memory_items`, `namespaces` tables). Each embedding row keeps `id`, `chunk_id`, `model`, `dim`, `vector`, and a `normalized` flag.
- **Chunking & metadata**: Items are broken into chunks; embeddings attach to chunks via `chunk_id`. Item metadata lives as structured JSON on `memory_items` with tags, TTL, and revision controls.
- **Namespace model**: Logical scopes (e.g. `project:user:thread`) are first-class rows. Retrieval joins embeddings back to items to recover text + metadata.
- **Fallback mode**: Local-only path uses SQLite plus a FAISS sidecar (see `docs/MEMORY.md`) but the production design assumes Postgres.
## AnythingLLM Vector Stack (PGVector path)
- Supports multiple vector backends; the overlapping option is `pgvector` (`server/utils/vectorDbProviders/pgvector/index.js`).
- Expects a single table (default `anythingllm_vectors`) shaped as `{ id UUID, namespace TEXT, embedding vector(n), metadata JSONB, created_at TIMESTAMP }`.
- Metadata is stored inline as JSONB; namespace strings are arbitrary workspace slugs. The embed dimension is fixed per table at creation time.
- The NodeJS runtime manages chunking, caching, and namespace hygiene, and assumes CRUD against that flat table.
## Key Differences
- **Schema shape**: Kompanion splits data across normalized tables with foreign keys; AnythingLLM uses a single wide table per vector store. Kompanions embeddings currently lack a JSONB metadata column and instead rely on joins.
- **Identifiers**: Kompanion embeddings key off `chunk_id` (uuid/text) plus `model`; AnythingLLM expects a unique `id` per stored chunk and does not expose the underlying chunk relationship.
- **Metadata transport**: Kompanion keeps tags/TTL in `memory_items` (JSON) and chunk text in `memory_chunks`. AnythingLLM packs metadata (including document references and source identifiers) directly into the vector rows JSONB.
- **Lifecycle hooks**: Kompanion enforces sensitivity flags before embedding; AnythingLLM assumes documents are already filtered and will happily ingest any chunk. Deletion flows differ (Kompanion uses soft-delete semantics; AnythingLLM issues hard deletes by namespace/document).
- **Embeddings contract**: Kompanion records embedding model and dimension per row; AnythingLLM fixes dimension at table creation and stores model choice in JSON metadata.
## Compatibility Plan
1. **Agree on a shared pgvector table**
- Create (or reuse) a Postgres schema reachable by both systems.
- Define a composite view or materialized view that maps `embeddings` + `memory_chunks` + `memory_items` into the `anythingLLM` layout (columns: `id`, `namespace`, `embedding`, `metadata`, `created_at`).
- Add a JSONB projection that captures Kompanion metadata (`chunk_id`, `item_id`, `tags`, `model`, `revision`, sensitivity flags). This becomes the `metadata` field for AnythingLLM.
2. **Write a synchronization job**
- Option A: database triggers on `embeddings` to insert/update a mirror row in `anythingllm_vectors`.
- Option B: periodic worker that scans for new/updated embeddings (`revision` or `updated_at`) and upserts into the shared table through SQL.
- Ensure deletions (soft or hard) propagate by expiring mirrored rows or respecting a `deleted_at` flag in metadata (AnythingLLM supports document purges via namespace filtering).
3. **Normalize namespace semantics**
- Reuse Kompanions namespace string as the AnythingLLM workspace slug.
- Document mapping rules (e.g. replace `:` with `_` if AnythingLLM slugs disallow colons).
- Provide a compatibility map in metadata so both systems resolve back to Kompanions canonical namespace identity.
4. **Unify embedding models**
- Select a shared embedding model (e.g., `text-embedding-3-large` or local Nomic).
- Record the chosen model in the mirrored metadata and enforce dimension on the `anythingllm_vectors` table creation.
- Update Kompanions embedding pipeline to fail fast if the produced dimension differs from the tables fixed size.
5. **Expose retrieval APIs**
- For Kompanion → AnythingLLM: implement a thin adapter that reads from the shared table instead of internal joins when responding to AnythingLLM requests (or simply let AnythingLLM talk directly to Postgres).
- For AnythingLLM → Kompanion: ensure the metadata payload includes the necessary identifiers (`item_id`, `chunk_id`) so Kompanion can resolve back to full context.
6. **Security & sensitivity handling**
- Extend the metadata JSON to include Kompanions sensitivity/embeddable flags.
- Patch AnythingLLM ingestion to respect a `sensitivity` key (skip or mask secrets) before inserting into its table, or filter at the view level so secret rows never surface.
7. **Validation & tooling**
- Add a migration checklist covering table creation, index alignment (`USING ivfflat`), and permission grants for the AnythingLLM service role.
- Create integration tests that:
1. Upsert an item in Kompanion.
2. Confirm mirrored row appears in `anythingllm_vectors`.
3. Query through AnythingLLM API and verify the same chunk text + metadata round-trips.
## Near-Term Tasks
1. Draft SQL for the projection view/materialized view, including JSONB assembly.
2. Prototype a synchronization worker (Python or C++) that mirrors embeddings into the AnythingLLM table.
3. Define namespace slug normalization rules and document them in both repos.
4. Coordinate on embedding model selection and update configuration in both stacks.
5. Add automated compatibility tests to CI pipelines of both projects.