doc: add anything-llm compatibility plan
This commit is contained in:
parent
db01fb0485
commit
a84af5d464
|
|
@ -0,0 +1,63 @@
|
||||||
|
# AnythingLLM ↔ Kompanion Memory Compatibility Evaluation
|
||||||
|
|
||||||
|
## Current Kompanion Memory Stack
|
||||||
|
- **Primary store**: Postgres 14+ with `pgvector` ≥ 0.6, accessed via the C++ `PgDal` implementation (`embeddings`, `memory_chunks`, `memory_items`, `namespaces` tables). Each embedding row keeps `id`, `chunk_id`, `model`, `dim`, `vector`, and a `normalized` flag.
|
||||||
|
- **Chunking & metadata**: Items are broken into chunks; embeddings attach to chunks via `chunk_id`. Item metadata lives as structured JSON on `memory_items` with tags, TTL, and revision controls.
|
||||||
|
- **Namespace model**: Logical scopes (e.g. `project:user:thread`) are first-class rows. Retrieval joins embeddings back to items to recover text + metadata.
|
||||||
|
- **Fallback mode**: Local-only path uses SQLite plus a FAISS sidecar (see `docs/MEMORY.md`) but the production design assumes Postgres.
|
||||||
|
|
||||||
|
## AnythingLLM Vector Stack (PGVector path)
|
||||||
|
- Supports multiple vector backends; the overlapping option is `pgvector` (`server/utils/vectorDbProviders/pgvector/index.js`).
|
||||||
|
- Expects a single table (default `anythingllm_vectors`) shaped as `{ id UUID, namespace TEXT, embedding vector(n), metadata JSONB, created_at TIMESTAMP }`.
|
||||||
|
- Metadata is stored inline as JSONB; namespace strings are arbitrary workspace slugs. The embed dimension is fixed per table at creation time.
|
||||||
|
- The NodeJS runtime manages chunking, caching, and namespace hygiene, and assumes CRUD against that flat table.
|
||||||
|
|
||||||
|
## Key Differences
|
||||||
|
- **Schema shape**: Kompanion splits data across normalized tables with foreign keys; AnythingLLM uses a single wide table per vector store. Kompanion’s embeddings currently lack a JSONB metadata column and instead rely on joins.
|
||||||
|
- **Identifiers**: Kompanion embeddings key off `chunk_id` (uuid/text) plus `model`; AnythingLLM expects a unique `id` per stored chunk and does not expose the underlying chunk relationship.
|
||||||
|
- **Metadata transport**: Kompanion keeps tags/TTL in `memory_items` (JSON) and chunk text in `memory_chunks`. AnythingLLM packs metadata (including document references and source identifiers) directly into the vector row’s JSONB.
|
||||||
|
- **Lifecycle hooks**: Kompanion enforces sensitivity flags before embedding; AnythingLLM assumes documents are already filtered and will happily ingest any chunk. Deletion flows differ (Kompanion uses soft-delete semantics; AnythingLLM issues hard deletes by namespace/document).
|
||||||
|
- **Embeddings contract**: Kompanion records embedding model and dimension per row; AnythingLLM fixes dimension at table creation and stores model choice in JSON metadata.
|
||||||
|
|
||||||
|
## Compatibility Plan
|
||||||
|
1. **Agree on a shared pgvector table**
|
||||||
|
- Create (or reuse) a Postgres schema reachable by both systems.
|
||||||
|
- Define a composite view or materialized view that maps `embeddings` + `memory_chunks` + `memory_items` into the `anythingLLM` layout (columns: `id`, `namespace`, `embedding`, `metadata`, `created_at`).
|
||||||
|
- Add a JSONB projection that captures Kompanion metadata (`chunk_id`, `item_id`, `tags`, `model`, `revision`, sensitivity flags). This becomes the `metadata` field for AnythingLLM.
|
||||||
|
|
||||||
|
2. **Write a synchronization job**
|
||||||
|
- Option A: database triggers on `embeddings` to insert/update a mirror row in `anythingllm_vectors`.
|
||||||
|
- Option B: periodic worker that scans for new/updated embeddings (`revision` or `updated_at`) and upserts into the shared table through SQL.
|
||||||
|
- Ensure deletions (soft or hard) propagate by expiring mirrored rows or respecting a `deleted_at` flag in metadata (AnythingLLM supports document purges via namespace filtering).
|
||||||
|
|
||||||
|
3. **Normalize namespace semantics**
|
||||||
|
- Reuse Kompanion’s namespace string as the AnythingLLM workspace slug.
|
||||||
|
- Document mapping rules (e.g. replace `:` with `_` if AnythingLLM slugs disallow colons).
|
||||||
|
- Provide a compatibility map in metadata so both systems resolve back to Kompanion’s canonical namespace identity.
|
||||||
|
|
||||||
|
4. **Unify embedding models**
|
||||||
|
- Select a shared embedding model (e.g., `text-embedding-3-large` or local Nomic).
|
||||||
|
- Record the chosen model in the mirrored metadata and enforce dimension on the `anythingllm_vectors` table creation.
|
||||||
|
- Update Kompanion’s embedding pipeline to fail fast if the produced dimension differs from the table’s fixed size.
|
||||||
|
|
||||||
|
5. **Expose retrieval APIs**
|
||||||
|
- For Kompanion → AnythingLLM: implement a thin adapter that reads from the shared table instead of internal joins when responding to AnythingLLM requests (or simply let AnythingLLM talk directly to Postgres).
|
||||||
|
- For AnythingLLM → Kompanion: ensure the metadata payload includes the necessary identifiers (`item_id`, `chunk_id`) so Kompanion can resolve back to full context.
|
||||||
|
|
||||||
|
6. **Security & sensitivity handling**
|
||||||
|
- Extend the metadata JSON to include Kompanion’s sensitivity/embeddable flags.
|
||||||
|
- Patch AnythingLLM ingestion to respect a `sensitivity` key (skip or mask secrets) before inserting into its table, or filter at the view level so secret rows never surface.
|
||||||
|
|
||||||
|
7. **Validation & tooling**
|
||||||
|
- Add a migration checklist covering table creation, index alignment (`USING ivfflat`), and permission grants for the AnythingLLM service role.
|
||||||
|
- Create integration tests that:
|
||||||
|
1. Upsert an item in Kompanion.
|
||||||
|
2. Confirm mirrored row appears in `anythingllm_vectors`.
|
||||||
|
3. Query through AnythingLLM API and verify the same chunk text + metadata round-trips.
|
||||||
|
|
||||||
|
## Near-Term Tasks
|
||||||
|
1. Draft SQL for the projection view/materialized view, including JSONB assembly.
|
||||||
|
2. Prototype a synchronization worker (Python or C++) that mirrors embeddings into the AnythingLLM table.
|
||||||
|
3. Define namespace slug normalization rules and document them in both repos.
|
||||||
|
4. Coordinate on embedding model selection and update configuration in both stacks.
|
||||||
|
5. Add automated compatibility tests to CI pipelines of both projects.
|
||||||
Loading…
Reference in New Issue