Practical Guide: Building a Memory-Enabled AI Writing Partner (ikun) – Retrieval System (Vector Search, Hybrid Search & Cloud Deployment)

Shengxu included in AI DevOps

2026-01-28 About 3200 words 15 minutes

Contents

In “Practical · Building a Memory-Enabled AI Writing Partner (Part 1): Multi-Agent Architecture Evolution”, I clarified how multiple agents collaborate and how memory is chained together. In “Practical · Building a Memory-Enabled AI Writing Partner (Part 2): Database Evolution (From JSON to Single Database to Relational Tables)”, I reviewed the evolution of the “fact layer” from JSON to SQLite and then to relational tables.

However, when the text length reaches hundreds of thousands of words, what truly determines the experience is often not “whether the data exists,” but “whether I can retrieve it”: exact lookup (did it appear or not), structured filtering (who belongs to whom), and semantic association (is it similar, is it the same atmosphere) must all work simultaneously. So I added a clear “index layer” to FantasyNovelAgent and expanded retrieval from “chapters” to the “full knowledge graph.”

1. First, Clarify the Boundaries: Fact Layer vs. Index Layer

From here on, I establish a fundamental principle:

Source of Truth = data/novel.db (structured data/metadata/KV/FTS) + data/blob_store/ (chapter text objects). Any index, cache, or derived structure must be rebuildable from the Source of Truth.

This principle directly determines how the vector database is designed: the vector database can only be an “index layer,” not a “second Source of Truth.”

The index layer can be rebuilt at any time, can be upgraded with the model, but cannot become the anchor point for facts. Therefore, I structure the retrieval system as a sidecar:

Fact Layer: data/novel.db + data/blob_store/
Index Layer: data/vector_db/ (vector database, rebuildable)

The following diagram shows the minimal architecture view of “Fact Layer vs. Index Layer”:

flowchart LR
  UI["Streamlit UI"] --> CM["ContextManager"]
  CM -->|Read / Write| DB[("data/novel.db
SQLite: Structured / KV / FTS / Metadata")]
  CM -->|Read / Write| BLOB["data/blob_store/
Chapter text objects by ULID"]
  CM -->|Vector index / retrieval| VEC[("data/vector_db/
ChromaDB index layer")]
  VEC --> EMB{"Embedding backend
HF / ONNX / OpenAI"}
  DB -.->|Rebuildable| VEC
  BLOB -.->|Rebuildable| VEC

flowchart LR
  UI["Streamlit UI"] --> CM["ContextManager"]
  CM -->|Read / Write| DB[("data/novel.db
SQLite: Structured / KV / FTS / Metadata")]
  CM -->|Read / Write| BLOB["data/blob_store/
Chapter text objects by ULID"]
  CM -->|Vector index / retrieval| VEC[("data/vector_db/
ChromaDB index layer")]
  VEC --> EMB{"Embedding backend
HF / ONNX / OpenAI"}
  DB -.->|Rebuildable| VEC
  BLOB -.->|Rebuildable| VEC

flowchart LR
  UI["Streamlit UI"] --> CM["ContextManager"]
  CM -->|Read / Write| DB[("data/novel.db
SQLite: Structured / KV / FTS / Metadata")]
  CM -->|Read / Write| BLOB["data/blob_store/
Chapter text objects by ULID"]
  CM -->|Vector index / retrieval| VEC[("data/vector_db/
ChromaDB index layer")]
  VEC --> EMB{"Embedding backend
HF / ONNX / OpenAI"}
  DB -.->|Rebuildable| VEC
  BLOB -.->|Rebuildable| VEC

2. Vector Retrieval (ChromaDB): Making “Semantic Association” a Usable Capability

Relational tables solve “deterministic facts” and “structured queries.” But a writing system also needs to solve another type of problem: semantic association.

“I want to write a passage about feeling disheartened after betrayal; retrieve the most similar scenes for me.”
“Where did the ‘Azure Cloud Sword’ mentioned in this chapter appear before? Has its status changed?”
“What is the mocking catchphrase of Villain A? Find me a few most similar dialogues.”

The commonality of these problems is: it’s hard to express them with a definite field. This is where vector retrieval comes in.

2.1 What Does the Vector Database Actually Do?

You can think of “vector retrieval” as three steps:

Convert text into vectors (Embedding)
The model maps a piece of text into a high-dimensional list of numbers (e.g., 384 or 768 dimensions). Texts with similar meanings will have closer vectors.
Put the vectors into an index (Index)
When the number of texts is large, you can’t do a full comparison every time. The vector database uses an approximate nearest neighbor index (commonly HNSW) to speed up retrieval.
When querying, convert the question into a vector too, then find the “nearest few segments”
This is “semantic retrieval”: you don’t need to input the same keywords to retrieve passages with similar meanings.

In a nutshell:

SQL excels at answering “what is it / how many / who belongs to whom,” while vector databases excel at answering “is it similar / is it the same atmosphere / is it the same type of conflict.”

2.2 Engineering Bottom Line: The Vector Database is a Rebuildable Index Layer

The data principle I adhere to is:

Source of Truth: data/novel.db handles structured data/metadata/KV/FTS; chapter text is in data/blob_store/
Index Replica: The vector database stores “chunked text copies + vector indices”; its value lies in retrieval speed and semantic capability
Rebuildable: If the vector database is corrupted or the model is upgraded, it can be fully rebuilt from the Source of Truth

Therefore, the current implementation adopts a “sidecar” form, rather than stuffing embeddings directly into novel.db:

Vector database directory: data/vector_db/
ChromaDB persistence: data/vector_db/chroma.sqlite3 (stores metadata/records)
HNSW index files: data/vector_db/<uuid>/*.bin (stores vector neighbor graph indices)

Visualizing the “vector database sidecar” makes it more intuitive:

flowchart TB
  subgraph FACT["Fact Layer: Source of Truth"]
    DB[("data/novel.db")]
    BLOB["data/blob_store/"]
    DB --> CH["chapters / drafts"]
    DB --> KV["kv_store"]
    DB --> REL["Relational tables
characters / organizations / ..."]
  end

  subgraph INDEX["Index Layer: Rebuildable"]
    VEC[("data/vector_db/")]
    VEC --> CHS["chunks
source_type=chapter"]
    VEC --> ECS["entity_card
characters / maps / worldbuilding"]
    VEC --> INF["inference"]
    VEC --> MYS["mystery"]
  end

  DB -.->|Full rebuild / incremental update| VEC
  BLOB -.->|Full rebuild / incremental update| VEC

flowchart TB
  subgraph FACT["Fact Layer: Source of Truth"]
    DB[("data/novel.db")]
    BLOB["data/blob_store/"]
    DB --> CH["chapters / drafts"]
    DB --> KV["kv_store"]
    DB --> REL["Relational tables
characters / organizations / ..."]
  end

  subgraph INDEX["Index Layer: Rebuildable"]
    VEC[("data/vector_db/")]
    VEC --> CHS["chunks
source_type=chapter"]
    VEC --> ECS["entity_card
characters / maps / worldbuilding"]
    VEC --> INF["inference"]
    VEC --> MYS["mystery"]
  end

  DB -.->|Full rebuild / incremental update| VEC
  BLOB -.->|Full rebuild / incremental update| VEC

flowchart TB
  subgraph FACT["Fact Layer: Source of Truth"]
    DB[("data/novel.db")]
    BLOB["data/blob_store/"]
    DB --> CH["chapters / drafts"]
    DB --> KV["kv_store"]
    DB --> REL["Relational tables
characters / organizations / ..."]
  end

  subgraph INDEX["Index Layer: Rebuildable"]
    VEC[("data/vector_db/")]
    VEC --> CHS["chunks
source_type=chapter"]
    VEC --> ECS["entity_card
characters / maps / worldbuilding"]
    VEC --> INF["inference"]
    VEC --> MYS["mystery"]
  end

  DB -.->|Full rebuild / incremental update| VEC
  BLOB -.->|Full rebuild / incremental update| VEC

2.3 Concrete Implementation

1) Selection: ChromaDB (Local Persistence + Out-of-the-Box)

My reason for choosing ChromaDB is simple: it can persist locally and encapsulates the “collection + HNSW” indexing capability simply enough to get the loop running first.

Key points:

Persistent client: chromadb.PersistentClient(path="data/vector_db")
collection: novel_chunks
Distance space: cosine

2) Embedding: Local HuggingFace + Online Fallback

Ideally, I use a local HF model for embedding (mean pooling + normalize) to minimize online dependencies.

However, in ARM environments like a Raspberry Pi, engineering often encounters a practical problem: certain torch/inference library binary wheels are incompatible with the CPU instruction set, causing a hard crash (Illegal instruction) at runtime (cannot be caught by try/except).

Therefore, the current implementation provides “multi-backend”:

Local HF/torch: Lowest invocation cost, suitable for x86/Linux or verified compatible environments
OpenAI Embedding (Remote): A stable fallback in ARM environments (at the cost of internet connectivity and embedding API fees)

3) Chunking: Semantic Chunking (Prioritizing Paragraph/Sentence Boundaries)

Why chunk? Because a chapter can be thousands to tens of thousands of words; you need “smaller, retrievable fragments,” otherwise vector retrieval will return a large blob of text, which is both inaccurate and won’t fit into the context.

Initially, I used a baseline approach of “fixed character sliding window + overlap,” but in a novel context, this easily cuts off dialogue/action chains, leading to retrieved fragments lacking context.

Now I’ve upgraded to “semantic chunking”:

Prioritize paragraph breaks: Use blank lines as natural boundaries, assembling paragraphs into chunks close to the target length
For long paragraphs, split by periods/question marks/exclamation marks: Keep sentences as intact as possible
Lightweight overlap: Use a 1-paragraph overlap at the paragraph level to preserve dialogue/action continuity as much as possible

Long-form novels also have a “vector retrieval specific” pitfall: pronoun context (he/she/it). If a chunk starts with “He drew his sword,” the model might not know who “he” is during retrieval. Future enhancements could include:

Attaching the chunk’s primary_character_id (or POV character) in metadata for “filtering or weighting by main character/POV” after retrieval
Or automatically prepending a very short “reference hint” to the chunk text (e.g., “POV for this segment: XXX”) to reduce context pollution

The chunking and update logic is placed in the synchronization flow “after a chapter is successfully saved,” ensuring the index doesn’t lag behind the text.

4) Index Design for “Attached Entities”: ID and Metadata

Vector retrieval must be able to trace back to “where it came from”; otherwise, results are uninterpretable and unmaintainable.

Currently, I clearly define the identity of each chunk:

id: ch_{chapter_ulid}_{chunk_index} (avoids index drift if titles are renamed)
metadata:
- chapter_id
- chapter_ulid
- chapter_title
- chunk_index
- source_type="chapter"

This allows me to filter with where={"chapter_title": ...} and clearly display retrieval results as “from which chapter, which segment.”

(Future expansion to entity cards, inferences, unresolved plot points, etc., only requires adding entity_type/entity_id to the metadata and extending the chunk source from “chapter” to “any entity.”)

5) Update Strategy: “Delete Before Write” on Chapter Update for Consistency

The vector database is an index layer; the biggest fear is “index not updated, leading to retrieval of old content.” Therefore, I adopt a simple and reliable strategy:

After successfully saving a chapter:
- First, delete(where={"chapter_ulid": ...}) (fallback to deleting by title if no ulid)
- Re-chunk
- Batch add

This makes updates idempotent, the logic is clear, and it’s easy to debug.

6) Two Rebuild Methods: Incremental Update + Full Initialization

For operability, I maintain two paths:

Incremental Update: Automatically updates the vector database when saving chapters during daily writing (same as above)
Full Rebuild: Reads all chapters from novel.db, resets the collection, and rebuilds the index

7) Retrieval Entry Point: From ContextManager to UI

The retrieval call chain is:

ContextManager.search_vectors() → VectorManager.search()
The UI provides a “Retrieval-Augmented Generation (RAG)” panel in the main window: supports Hybrid (keyword + semantic) / Keyword only (FTS) / Semantic only (vector), and displays the most recent hit segment

Retrieval-Augmented Generation (RAG) Panel: Hybrid / FTS / Vector

2.4 What the Vector Database Can and Cannot Solve

What the Vector Database Excels At

Fuzzy Retrieval: Find “similar emotions / similar conflicts / similar descriptions”
Memory Extension for Long Books: Quickly retrieve relevant segments from hundreds of thousands of words and assemble them into context
Style and Character Speech Habits: Use “past dialogue segments” to help the model mimic catchphrases and tone

What the Vector Database is Not Good At (Still Needs Relational Tables)

Deterministic State: Whether the protagonist’s current cultivation level is Golden Core or Nascent Soul requires exact match, not fuzzy
Transactional Updates: Item transfers, ownership changes require atomicity and consistency
Structured Filtering: For example, “all surviving disciples belonging to Azure Cloud Sect,” a single SQL statement provides the precise answer

The best combination is always:

Relational Tables (Left Brain): Facts, states, relationship networks, timelines
Vector Database (Right Brain): Association, atmosphere, semantic similarity, memory retrieval

3. Hybrid Retrieval and Full Knowledge Graph: Giving AI “Complete Memory”

The data layer is now a clearly layered system:

data/novel.db: Source of Truth (structured data/metadata/KV/FTS)
data/blob_store/: Source of Truth (chapter text objects, by ulid)
data/vector_db/: Semantic retrieval index (rebuildable)

This means the system is no longer just “able to store and query,” but is beginning to possess the complete retrieval capability of “being able to retrieve and assemble context.”

3.1 Hybrid Retrieval: FTS5 (Exact Lookup) + Vector (Semantic)

Vector retrieval solves “is it similar,” FTS5 solves “did it appear.” They are naturally complementary.

Currently, I present them side-by-side as “dual index layer engines” in the main window, with three mode switches: Hybrid / Keyword only / Semantic only.

More importantly, this is not a “simple concatenation of two results.” In engineering, a common pitfall is “cascading filtering”: first, use FTS to get a candidate set, then only perform vector retrieval within that candidate set. This saves computation but has risks:

For example, if I search for “a feeling of despair,” FTS might not match a single word, resulting in an empty candidate set; but vector retrieval could have retrieved the passage about “feeling disheartened.”

Therefore, my overall approach is “parallel retrieval + fusion ranking”:

Vector Retrieval (Full Database): Run semantic retrieval first to ensure “associative ability is not blocked by keywords”
FTS (Keywords): Run exact lookup simultaneously to ensure deterministic hits for names, places, artifacts, etc.
Fusion: Apply a lightweight fusion ranking (e.g., RRF, Reciprocal Rank Fusion) to the retrieved results, naturally ranking items that “hit both keywords and are semantically similar” higher.

I also retain the optimization path of “FTS candidate → vector retrieval within candidates”: when FTS can hit a clear candidate chapter, I can perform more granular vector retrieval only within that candidate chapter, then fuse it with the full-database vector retrieval, balancing speed and quality.

3.2 FTS5 Synchronization Method: From Triggers to Application-Layer Updates

To adapt to the architecture where text is split into the blob store, I adjusted the synchronization method for chapters_fts to a “manual update” performed by save_chapter(), rather than relying on triggers for automatic synchronization.

The core benefit of this is: the retrieval layer is no longer tightly bound by internal database triggers; even if the text storage format changes, the index can still be maintained at the application layer in a clear and controllable manner.

3.3 Attaching Vectors to “Entity IDs,” Expanding from Chapters to the Full Knowledge Graph

Previously, the vector database only stored chapter chunks. Now, I’ve expanded the index to the entire entity semantic network:

Chapter chunks: source_type="chapter" (with chapter_id/chapter_ulid/chapter_title/chunk_index)
Entity card chunks: source_type="entity_card" (currently covers characters/maps/worldbuilding, with entity_type/entity_key)
Inference/Unresolved Plot Point entries: source_type="inference" / source_type="mystery" (using the entry text as the retrievable unit)

This allows vector retrieval to “retrieve chapter passages + related entity cards/inferences/unresolved plot points in one query,” which is ideal for RAG context assembly.

This change might seem like “just indexing more text,” but it’s significant for the writing system because it upgrades retrieval from “only finding original text” to “being able to bring back the entire worldbuilding”:

When I ask about a noun/clue (e.g., an artifact, a faction, a character), the system can not only retrieve which passages of text it appears in
But also simultaneously retrieve the corresponding character card/location card/worldbuilding fragment, as well as related inferences/unresolved plot points

The ultimate effect is: RAG is no longer a “chapter-level retrieval add-on,” but begins to possess a “retrievable view of the entire book’s knowledge graph.”

4. Future Outlook: Cloud Migration Reservations

If the previous evolution solved “runs reliably on a single machine, gets more stable as you write,” the next step is to address: multi-device sync, long-term operation, and anytime access.

4.1 What Are the Core Needs of a Cloud Service?

Putting a writing system in the cloud isn’t primarily about “high concurrency” or “massive users.” It’s about:

Concurrent writes and sync for the fact layer: No more gambling on syncing an entire db file.
Rebuildable but always-available index layer: Embedding upgrades, index corruption, or model swaps must not affect fact consistency.
API-ification and access control: Any device calls via HTTP; authentication, quotas, and logging must be manageable.
Low operational overhead: No desire to maintain a server, manage containers, or write upgrade and backup scripts.

4.2 What Can Major Cloud Providers Offer?

Mapping these needs to cloud products boils down to three capabilities:

Compute (API/Orchestration): Serverless Functions / Edge Functions / Cloud Run
Relational Data (Fact Layer): Managed Postgres/MySQL or cloud-native SQL
Vector Search (Index Layer): Managed vector databases or embeddings stored in a database (pgvector, etc.)

Corresponding common solutions:

AWS: Lambda + RDS (or Aurora) + vector/search service ecosystem. Powerful but complex to configure, and relational databases often carry the mental burden of “paying even when idle.”
Google Cloud: Cloud Run + Cloud SQL / Firestore + Vertex AI. Good developer experience, but the ecosystem feels “heavy” for personal projects.
Supabase: Managed Postgres + pgvector feels very natural and has a mature ecosystem. However, the free tier has a pause mechanism, and cold starts can affect the experience in some scenarios.

4.3 Cloud Migration Path: Prioritizing Cloudflare (D1 + Vectorize + Workers)

My plan is to upgrade this project from a “single-machine tool” to a service that is “accessible online, syncable across devices, and capable of long-term operation.” Based on the current project structure (data/novel.db + data/blob_store/ + vector index), I will prioritize migrating to a set of Cloudflare managed services, splitting the “fact layer” and “index layer” to the cloud:

Relational Tables: Migrate from local SQLite to Cloudflare D1 (serverless SQL, billed by rows read/written; the free tier has daily limits and storage quotas). Reference: D1 Pricing
Chapter Object Storage: Chapter text is “large text” that has already been moved out of the database and stored as objects (locally in data/blob_store/). For the cloud, migrate to Cloudflare R2 (S3-compatible object storage). D1 should only retain metadata like chapters.ulid/content_key and searchable summary fields to reduce database size and write pressure.
Vector Database: Migrate from local Chroma to Cloudflare Vectorize (the free tier has limits on indexes, namespaces, vectors per index, etc., making it suitable for semantic search in personal/small-scale works). Reference: Vectorize Limits
Search Orchestration: Run the “search fusion logic” (FTS/structured filtering/vector reranking) on Cloudflare Workers. The free tier has limits on request volume and CPU time, which need to be evaluated based on actual access patterns. Reference: Workers Pricing/Free Tier Info

The key principle of this path remains: D1/R2/object storage holds the fact data, while Vectorize holds the rebuildable vector index layer, preventing the index from becoming a “second source of truth.”

If the decision is made to move to the Postgres ecosystem in the future (e.g., for complex SQL, ecosystem tooling, or stronger transactional capabilities), migrating the relational tables to Postgres and using pgvector for embeddings is a natural next step: store embeddings in a vector(n) column, build HNSW/IVFFlat indexes, and easily join with business tables.

5. Summary

This article is about one thing: turning “having memory” into “being able to retrieve.”

Relational tables handle deterministic facts; vector indexes handle semantic association.
FTS5 handles exact lookups; hybrid search turns both into a stable experience.
The index expands from chapters to the entire knowledge graph, so RAG context is no longer just “re-reading the original text.”

If you want to start reading from the fact layer, I recommend beginning with Building a Memory-Equipped AI Writing Partner (Part 2): Database Evolution (From JSON to a Single Database to Relational Tables).

Want updates? Subscribe via RSS