Unoplat Blog
When Graphs Are Overkill: Why We Moved Our Knowledge Graph from Neo4j to Postgres

When shallow traversals reveal simpler paths

We built a knowledge graph to power Unoplat Code Confluence’s AgentMD feature. Over the past year, we kept optimizing our schema until the maximum traversal depth dropped to 4, and most queries became simple lookups with 2–4 hop joins.

The operational weight of Neo4j no longer matched our actual traversal needs. Maintaining Cypher queries added friction, and the embedding ecosystem had limitations. That pushed us to re-evaluate the stack and move our graph to PostgreSQL.

What changed in v0.22.0

Unoplat Code Confluence v0.22.0 migrated its knowledge graph storage from Neo4j to PostgreSQL. The goal wasn’t to abandon graph modeling—it was to keep graph semantics while reducing operational complexity.

The reasoning, the proof, and the outcomes

Foundational concerns

Will Postgres keep up at scale?

It’s reasonable to worry that replacing a graph database with Postgres could create scaling bottlenecks as we grow. The reality? Postgres scales far beyond a single node when paired with the Citus extension, which distributes data and query execution while preserving SQL and ACID semantics.

Here’s proof from production systems:

CompanyDomainResults
IntrinioFintech99% of queries at ~120 ms on large, multi-join financial datasets with billions of records
PushOwlE-commerce analytics8-second queries → ~200 ms, 95th percentile at 30–40 ms, serving 8,500+ tenants
ConvertFlowMarketing analytics15–30 second queries → sub-500 ms across hundreds of TB of event data
AlgoliaDev tools95% of queries under 800 ms while ingesting 5–10 billion rows per day

If your workload can shard by tenant, account, repository, or dataset, Citus turns Postgres into a horizontally scalable system without abandoning SQL. That’s the same strategy we apply when mapping a knowledge graph to relational keys.

Can Postgres match modern vector and RAG needs?

Postgres now has a robust vector stack:

  • VectorChord — Handles large dimensions and multiple index strategies
  • pgvectorscale — Brings StreamingDiskANN-style disk indexes to scale beyond RAM
  • pgai — Automates embedding generation and synchronization

This removes the manual Cypher and ETL loop we hit with Neo4j. For comparison, Neo4j’s vector index remains HNSW-only with a 4,096-dimension cap—less extensible for newer models and RAG workflows.

Why we moved

  • Mismatch between needs and tooling — Graph Schema optimization reduced our max traversal from 8 hops to 4, putting us squarely in SQL join territory.
  • Operational Complexity — We carried a JVM Infrastructure stack, separate pooling, and a second schema lifecycle, which reduces our turn around time on any feature, bug or any maintainance work.
  • Ecosystem alignment — Postgres offers everything we need: a mature vector ecosystem , native support as a Temporal persistence backend for our workflow orchestration, a proven foundation for our upcoming security infrastructure and operational metadata that our app needs.
  • Optimisation clarity — Unlike Cypher, SQL optimization draws on decades of publicly available blogs, courses, and case studies. This makes query intent, indexing strategy, and performance tuning decisions easier to learn, reason about, and iterate on.

How our data actually gets traversed

We audited our most used queries to validate traversal depth assumptions. The real shapes are use-case driven and shallow by design:

Use caseJoin pathHop count
Inbound interface discovery (framework → codebases)framework → codebase_framework → codebase2
Inbound interface discovery (feature → codebases)framework_feature → file_feature → file → codebase3
Outbound interface usage (codebase → feature usage)codebase → file → file_feature → framework_feature → framework4
Application data model spanscodebase → file → file_feature → framework_feature3
Database data model spanscodebase → file → file_feature → framework_feature3

Each path is a bounded set of FK joins with no recursive traversal. In Postgres, this is a small, index-friendly join depth, so the relational model stays fast.

Visual model comparison

Graph model (Neo4j-style)

flowchart TB
  Repo[GitRepository] -->|CONTAINS_CODEBASE| Codebase[Codebase]
  Codebase -->|CONTAINS_FILE| File[File]
  Codebase -->|HAS_PACKAGE_MANAGER_METADATA| PkgMeta[PackageManagerMetadata]
  Codebase -->|USES_FRAMEWORK| Framework[Framework]
  Framework -->|HAS_FEATURE| Feature[FrameworkFeature]
  File -->|USES_FEATURE| Feature

Relational model (PostgreSQL-style)

erDiagram
  CODE_CONFLUENCE_GIT_REPOSITORY ||--o{ CODE_CONFLUENCE_CODEBASE : has
  CODE_CONFLUENCE_CODEBASE ||--o{ CODE_CONFLUENCE_FILE : contains
  CODE_CONFLUENCE_CODEBASE ||--o{ CODE_CONFLUENCE_PACKAGE_MANAGER_METADATA : has
  CODE_CONFLUENCE_CODEBASE ||--o{ CODE_CONFLUENCE_CODEBASE_FRAMEWORK : uses
  FRAMEWORK ||--o{ CODE_CONFLUENCE_CODEBASE_FRAMEWORK : maps
  CODE_CONFLUENCE_FILE ||--o{ CODE_CONFLUENCE_FILE_FRAMEWORK_FEATURE : has
  FRAMEWORK_FEATURE ||--o{ CODE_CONFLUENCE_FILE_FRAMEWORK_FEATURE : maps

Operational impact

Deeper performance benchmarks are still in progress.

Industry perspectives: why graphs often fail in practice

Our experience isn’t unique. Here’s what practitioners with years of production experience have observed.

Jason Liu’s take

When asked “Is knowledge graph RAG production ready? Should we use it?”, Jason Liu (@jxnlco) shared a sobering perspective:

“After 10 years in ML, I stay away from graph databases. Every company I’ve seen go into the graph world moves back to SQL within 4–5 years.”

His most striking observation:

“Even Facebook’s ‘graph’ was actually a large MySQL database. The only company that truly needs graph databases is LinkedIn—for 3–5 degree friendship calculations.”

On the alternative approach for document graphs (like Microsoft’s GraphRAG):

“I’d rather use fine-tuned embeddings. A graph is just an adjacency matrix, and fine-tuning can get you close to that similarity definition without the operational complexity.”

His decision framework:

“Start with your data: let specific use cases justify graph complexity rather than choosing technology first. Graph might be 2% better, but if traditional approaches work well, that 2% rarely justifies the maintenance cost.”

Jo Bergum’s perspective

In an interview with Hamel Husain, Jo Bergum (@jobergum) made a similar observation:

“Most knowledge graphs are triplets and can live in a relational store. The hard part is building and maintaining the graph, not storing it.”

The insight here is crucial: the modeling challenge of knowledge graphs is separate from the storage choice. You can have graph semantics without graph infrastructure.

When graph databases are still worth it

If you need deep, multi-hop traversal at scale with strict latency goals—think friends-of-friends-of-friends at massive graph density—specialized graph databases still make sense.

The canonical example is LinkedIn: calculating 3–5 degree connections across hundreds of millions of users requires the kind of traversal optimization that graph databases are built for.

The key: let evaluation metrics drive the decision, not novelty or hype.

Evaluation principle: build a golden dataset

Before adopting specialized infrastructure:

  1. Build a small benchmark of queries and expected results
  2. Track retrieval metrics (precision, recall, latency)
  3. Compare graph vs. relational approaches on your data
  4. Quantify the improvement—is it 2% better or 50% better?

If the graph database doesn’t measurably improve outcomes, it’s likely technical debt waiting to happen.

Replicating a Neo4j model in PostgreSQL

You can keep graph semantics in Postgres using node tables, edge tables, and JSONB attributes:

Graph ConceptPostgreSQL Structure
Node typeA table per node type with primary key and properties
RelationshipsRelationship table with start_node_id, end_node_id, relation_type, and properties
Attribute flexibilityJSONB for flexible properties
TraversalRecursive CTEs for limited-depth traversal

Example schema

CREATE TABLE person (
    person_id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    age INT
);

CREATE TABLE project (
    project_id SERIAL PRIMARY KEY,
    title TEXT,
    created_at DATE
);

CREATE TABLE relationships (
    relationship_id SERIAL PRIMARY KEY,
    start_node_id INT NOT NULL,
    end_node_id INT NOT NULL,
    relation_type TEXT NOT NULL,
    attributes JSONB,
    CONSTRAINT fk_start FOREIGN KEY (start_node_id) REFERENCES person(person_id),
    CONSTRAINT fk_end   FOREIGN KEY (end_node_id)   REFERENCES project(project_id)
);

Simple traversal query

SELECT p.name, pr.title, r.attributes
FROM person p
JOIN relationships r
  ON r.start_node_id = p.person_id
  AND r.relation_type = 'WORKS_ON'
JOIN project pr
  ON pr.project_id = r.end_node_id;

Recursive traversal

WITH RECURSIVE path(person_id, project_id, depth) AS (
    SELECT r.start_node_id, r.end_node_id, 1
    FROM relationships r
    WHERE r.start_node_id = 1

    UNION ALL

    SELECT p.person_id, r.end_node_id, path.depth + 1
    FROM path p
    JOIN relationships r ON r.start_node_id = p.project_id
)
SELECT * FROM path;

What you can do next

  1. Compare the schema shift: Review unoplat-code-confluence-commons v0.40.0 vs v0.41.0
  2. Read the release notes: v0.22.0 changelog covers infrastructure savings and migration context
  3. Replicate this pattern: If your use case is dominated by shallow traversal, try the relational approach and measure against a golden dataset

References

J
Written by

Builder and maintainer of Unoplat.

Comments