When Graphs Are Overkill: Why We Moved Our Knowledge Graph from Neo4j to Postgres

TL;DR

We moved from a graph store to a relational schema because our retrieval patterns are shallow, repeatable, and index-friendly, so the graph database premium never paid off. Consolidation removed a separate JVM stack, saved 2–8 GB RAM and 2+ vCPU, and aligned embeddings, analytics, and ops in one place. Rule of thumb: if your core traversal fits in 2–4 joins, Postgres with CTEs is usually enough; save graph databases for deep, high-degree traversals where path exploration is the product.

When shallow traversals reveal simpler paths

We built a knowledge graph to power Unoplat Code Confluence’s AgentMD feature. Over the past year, we kept optimizing our schema until the maximum traversal depth dropped to 4, and most queries became simple lookups with 2–4 hop joins.

The operational weight of Neo4j no longer matched our actual traversal needs. Maintaining Cypher queries added friction, and the embedding ecosystem had limitations. That pushed us to re-evaluate the stack and move our graph to PostgreSQL.

What changed in v0.22.0

Unoplat Code Confluence v0.22.0 migrated its knowledge graph storage from Neo4j to PostgreSQL. The goal wasn’t to abandon graph modeling—it was to keep graph semantics while reducing operational complexity.

Who should read this?

If your retrieval or RAG use case doesn’t require deep multi-hop traversal, this post shows how to keep performance strong while simplifying infrastructure.

The reasoning, the proof, and the outcomes

Foundational concerns

Will Postgres keep up at scale?

It’s reasonable to worry that replacing a graph database with Postgres could create scaling bottlenecks as we grow. The reality? Postgres scales far beyond a single node when paired with the Citus extension, which distributes data and query execution while preserving SQL and ACID semantics.

Here’s proof from production systems:

Company	Domain	Results
Intrinio	Fintech	99% of queries at ~120 ms on large, multi-join financial datasets with billions of records
PushOwl	E-commerce analytics	8-second queries → ~200 ms, 95th percentile at 30–40 ms, serving 8,500+ tenants
ConvertFlow	Marketing analytics	15–30 second queries → sub-500 ms across hundreds of TB of event data
Algolia	Dev tools	95% of queries under 800 ms while ingesting 5–10 billion rows per day

If your workload can shard by tenant, account, repository, or dataset, Citus turns Postgres into a horizontally scalable system without abandoning SQL. That’s the same strategy we apply when mapping a knowledge graph to relational keys.

Can Postgres match modern vector and RAG needs?

Postgres now has a robust vector stack:

VectorChord — Handles large dimensions and multiple index strategies
pgvectorscale — Brings StreamingDiskANN-style disk indexes to scale beyond RAM
pgai — Automates embedding generation and synchronization

This removes the manual Cypher and ETL loop we hit with Neo4j. For comparison, Neo4j’s vector index remains HNSW-only with a 4,096-dimension cap—less extensible for newer models and RAG workflows.

Why we moved

Mismatch between needs and tooling — Graph Schema optimization reduced our max traversal from 8 hops to 4, putting us squarely in SQL join territory.
Operational Complexity — We carried a JVM Infrastructure stack, separate pooling, and a second schema lifecycle, which reduces our turn around time on any feature, bug or any maintainance work.
Ecosystem alignment — Postgres offers everything we need: a mature vector ecosystem , native support as a Temporal persistence backend for our workflow orchestration, a proven foundation for our upcoming security infrastructure and operational metadata that our app needs.
Optimisation clarity — Unlike Cypher, SQL optimization draws on decades of publicly available blogs, courses, and case studies. This makes query intent, indexing strategy, and performance tuning decisions easier to learn, reason about, and iterate on.

How our data actually gets traversed

We audited our most used queries to validate traversal depth assumptions. The real shapes are use-case driven and shallow by design:

Use case	Join path	Hop count
Inbound interface discovery (framework → codebases)	`framework → codebase_framework → codebase`	2
Inbound interface discovery (feature → codebases)	`framework_feature → file_feature → file → codebase`	3
Outbound interface usage (codebase → feature usage)	`codebase → file → file_feature → framework_feature → framework`	4
Application data model spans	`codebase → file → file_feature → framework_feature`	3
Database data model spans	`codebase → file → file_feature → framework_feature`	3

Each path is a bounded set of FK joins with no recursive traversal. In Postgres, this is a small, index-friendly join depth, so the relational model stays fast.

Visual model comparison

Graph model (Neo4j-style)

flowchart TB
  Repo[GitRepository] -->|CONTAINS_CODEBASE| Codebase[Codebase]
  Codebase -->|CONTAINS_FILE| File[File]
  Codebase -->|HAS_PACKAGE_MANAGER_METADATA| PkgMeta[PackageManagerMetadata]
  Codebase -->|USES_FRAMEWORK| Framework[Framework]
  Framework -->|HAS_FEATURE| Feature[FrameworkFeature]
  File -->|USES_FEATURE| Feature

Relational model (PostgreSQL-style)

erDiagram
  CODE_CONFLUENCE_GIT_REPOSITORY ||--o{ CODE_CONFLUENCE_CODEBASE : has
  CODE_CONFLUENCE_CODEBASE ||--o{ CODE_CONFLUENCE_FILE : contains
  CODE_CONFLUENCE_CODEBASE ||--o{ CODE_CONFLUENCE_PACKAGE_MANAGER_METADATA : has
  CODE_CONFLUENCE_CODEBASE ||--o{ CODE_CONFLUENCE_CODEBASE_FRAMEWORK : uses
  FRAMEWORK ||--o{ CODE_CONFLUENCE_CODEBASE_FRAMEWORK : maps
  CODE_CONFLUENCE_FILE ||--o{ CODE_CONFLUENCE_FILE_FRAMEWORK_FEATURE : has
  FRAMEWORK_FEATURE ||--o{ CODE_CONFLUENCE_FILE_FRAMEWORK_FEATURE : maps

Operational impact

Measured wins

Latency: Queries remain under 100 ms for codebases with 100–150 files
Infrastructure savings: Neo4j removal frees 2–8 GB RAM and 2+ vCPU
Simpler operations: One database, fewer connection pools, less schema management
Embedding ecosystem: Native vector extensions (VectorChord, pgvectorscale) remove HNSW-only and 4,096-dimension cap limitations
Automated pipelines: pgai enables automated embedding generation and sync—eliminating manual Cypher + ETL loops
Extensibility: Rich PostgreSQL extension ecosystem (Citus, vector extensions, pgai) provides a future-proof foundation

Deeper performance benchmarks are still in progress.

Industry perspectives: why graphs often fail in practice

Our experience isn’t unique. Here’s what practitioners with years of production experience have observed.

The hidden costs of graph databases

Hiring difficulty: Finding PostgreSQL experts is far easier than finding graph database specialists
Schema debates: Graph schema definition creates endless debates without clear best practices
Overengineering: Most use cases need only 2–4 traversals, not complex graph operations
Maintenance burden: Companies often migrate back to SQL within 4–5 years—the hard part is building and maintaining the graph, not storing it

Jason Liu’s take

When asked “Is knowledge graph RAG production ready? Should we use it?”, Jason Liu (@jxnlco) shared a sobering perspective:

“After 10 years in ML, I stay away from graph databases. Every company I’ve seen go into the graph world moves back to SQL within 4–5 years.”

His most striking observation:

“Even Facebook’s ‘graph’ was actually a large MySQL database. The only company that truly needs graph databases is LinkedIn—for 3–5 degree friendship calculations.”

On the alternative approach for document graphs (like Microsoft’s GraphRAG):

“I’d rather use fine-tuned embeddings. A graph is just an adjacency matrix, and fine-tuning can get you close to that similarity definition without the operational complexity.”

His decision framework:

“Start with your data: let specific use cases justify graph complexity rather than choosing technology first. Graph might be 2% better, but if traditional approaches work well, that 2% rarely justifies the maintenance cost.”

Jo Bergum’s perspective

In an interview with Hamel Husain, Jo Bergum (@jobergum) made a similar observation:

“Most knowledge graphs are triplets and can live in a relational store. The hard part is building and maintaining the graph, not storing it.”

The insight here is crucial: the modeling challenge of knowledge graphs is separate from the storage choice. You can have graph semantics without graph infrastructure.

When graph databases are still worth it

If you need deep, multi-hop traversal at scale with strict latency goals—think friends-of-friends-of-friends at massive graph density—specialized graph databases still make sense.

The canonical example is LinkedIn: calculating 3–5 degree connections across hundreds of millions of users requires the kind of traversal optimization that graph databases are built for.

The key: let evaluation metrics drive the decision, not novelty or hype.

Evaluation principle: build a golden dataset

Before adopting specialized infrastructure:

Build a small benchmark of queries and expected results
Track retrieval metrics (precision, recall, latency)
Compare graph vs. relational approaches on your data
Quantify the improvement—is it 2% better or 50% better?

If the graph database doesn’t measurably improve outcomes, it’s likely technical debt waiting to happen.

Replicating a Neo4j model in PostgreSQL

You can keep graph semantics in Postgres using node tables, edge tables, and JSONB attributes:

Graph Concept	PostgreSQL Structure
Node type	A table per node type with primary key and properties
Relationships	Relationship table with `start_node_id`, `end_node_id`, `relation_type`, and properties
Attribute flexibility	JSONB for flexible properties
Traversal	Recursive CTEs for limited-depth traversal

Example schema

CREATE TABLE person (
    person_id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    age INT
);

CREATE TABLE project (
    project_id SERIAL PRIMARY KEY,
    title TEXT,
    created_at DATE
);

CREATE TABLE relationships (
    relationship_id SERIAL PRIMARY KEY,
    start_node_id INT NOT NULL,
    end_node_id INT NOT NULL,
    relation_type TEXT NOT NULL,
    attributes JSONB,
    CONSTRAINT fk_start FOREIGN KEY (start_node_id) REFERENCES person(person_id),
    CONSTRAINT fk_end   FOREIGN KEY (end_node_id)   REFERENCES project(project_id)
);

Simple traversal query

SELECT p.name, pr.title, r.attributes
FROM person p
JOIN relationships r
  ON r.start_node_id = p.person_id
  AND r.relation_type = 'WORKS_ON'
JOIN project pr
  ON pr.project_id = r.end_node_id;

Recursive traversal

WITH RECURSIVE path(person_id, project_id, depth) AS (
    SELECT r.start_node_id, r.end_node_id, 1
    FROM relationships r
    WHERE r.start_node_id = 1

    UNION ALL

    SELECT p.person_id, r.end_node_id, path.depth + 1
    FROM path p
    JOIN relationships r ON r.start_node_id = p.project_id
)
SELECT * FROM path;

What you can do next

Compare the schema shift: Review unoplat-code-confluence-commons v0.40.0 vs v0.41.0
Read the release notes: v0.22.0 changelog covers infrastructure savings and migration context
Replicate this pattern: If your use case is dominated by shallow traversal, try the relational approach and measure against a golden dataset

References

Unoplat Code Confluence v0.22.0 release notes
Citus extension for distributed PostgreSQL
VectorChord — High-dimensional vector indexes
pgvectorscale — Disk-based vector indexes
pgai — Automated embedding pipelines
PostgreSQL docs: WITH queries (Common Table Expressions)
HNSW paper — Efficient and robust approximate nearest neighbor search
Neo4j docs: vector index limits and configuration
Video: Graph Databases — When to Use Them (And When to Run Away) — Hamel Husain and Joe Christian
Citus customer case studies
Jason Liu (@jxnlco) — Posts on graph database tradeoffs in production ML systems