cd /news/artificial-intelligence/show-dev-self-reinforcing-k-pop-data… · home topics artificial-intelligence article
[ARTICLE · art-42859] src=dev.to ↗ pub= topic=artificial-intelligence verified=true sentiment=↑ positive

Show Dev: Self-reinforcing K-pop data pipeline using Spring Boot and pgvector (Built on OCI Free Tier)

A Seoul-based backend developer built k-cosmos, an interactive 3D music space that maps K-pop tracks using 768-dimensional vector embeddings, running entirely on the OCI Free Tier. The system uses a self-reinforcing pipeline where an LLM analyzes mood and aesthetic to generate search keywords that fuel future ingestion, enabling autonomous growth. To handle performance constraints with 4,000 tracks, the developer implemented a three-phase transaction model with Java 21 Virtual Threads and a PostgreSQL query using pgvector for diversified nearest-neighbor search.

read2 min views1 publishedJun 29, 2026

Hi everyone,

I'm a backend developer based in Seoul. I built k-cosmos, an interactive web-based 3D music space that maps K-pop tracks based on 768-dimensional vector embeddings.

The main reason I had to build this from scratch is that there's no clean, structured K-pop metadata or emotional tag dataset available anywhere.

How the pipeline grows itself

It runs on an autonomous background sync cycle. First, the system ingests tracks and uses an LLM to analyze the mood and aesthetic. Then, the AI reverse-engineers low-latency search keywords based on that analysis. These keywords are absorbed back into the database to fuel the next day's ingestion scheduler, allowing the system to expand its data footprint without human intervention.

Architectural decisions under hard constraints

Since I am running everything on the OCI free tier with around 4,000 tracks, I had to resolve several performance bottlenecks at the database and thread layer.

Phase 1 (Short TX): Claim the target track via FOR UPDATE SKIP LOCKED and immediately flip the status to PROCESSING to isolate rows for worker concurrency. Commit and release the connection.

Phase 2 (Zero TX): Perform the heavy external network I/O and embedding generation while holding zero active DB connections.

Phase 3 (Short TX): Open a short transaction to persist the final structured entity data.

The entire flow runs over Java 21 Virtual Threads to minimize scheduling overhead during I/O wait states.

WITH candidates AS (
    SELECT *, embedding <=> CAST(:embedding AS vector) AS distance
    FROM cosmos_tracks
    WHERE status = 'COMPLETED' AND cluster_id = :clusterId AND id NOT IN (:excludeIds)
    ORDER BY embedding <=> CAST(:embedding AS vector) LIMIT :poolSize
),
diversified AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY artist ORDER BY distance) AS artist_rank
    FROM candidates
)
SELECT * FROM diversified WHERE artist_rank <= :maxPerArtist ORDER BY distance LIMIT :limit

This preserves index efficiency while strictly scattering artist density with a single roundtrip.

I deliberately chose a Thymeleaf SSR hybrid architecture to keep the deployment unit single and maintain high operational visibility (P6Spy, Actuator) instead of splitting into a separate SPA.

Live Project: https://cosmos.codeghost.cloud/

I'm very happy to discuss any architectural or design decisions. Let me know your thoughts or hit me with any questions!

── more in #artificial-intelligence 4 stories · sorted by recency
── more on @k-cosmos 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/show-dev-self-reinfo…] indexed:0 read:2min 2026-06-29 ·