Show Dev: Self-reinforcing K-pop data pipeline using Spring Boot and pgvector (Built on OCI Free Tier)

A Seoul-based backend developer built k-cosmos, an interactive 3D music space that maps K-pop tracks using 768-dimensional vector embeddings, running entirely on the OCI Free Tier. The system uses a self-reinforcing pipeline where an LLM analyzes mood and aesthetic to generate search keywords that fuel future ingestion, enabling autonomous growth. To handle performance constraints with 4,000 tracks, the developer implemented a three-phase transaction model with Java 21 Virtual Threads and a PostgreSQL query using pgvector for diversified nearest-neighbor search.

Hi everyone, I'm a backend developer based in Seoul. I built k-cosmos, an interactive web-based 3D music space that maps K-pop tracks based on 768-dimensional vector embeddings. The main reason I had to build this from scratch is that there's no clean, structured K-pop metadata or emotional tag dataset available anywhere. How the pipeline grows itself It runs on an autonomous background sync cycle. First, the system ingests tracks and uses an LLM to analyze the mood and aesthetic. Then, the AI reverse-engineers low-latency search keywords based on that analysis. These keywords are absorbed back into the database to fuel the next day's ingestion scheduler, allowing the system to expand its data footprint without human intervention. Architectural decisions under hard constraints Since I am running everything on the OCI free tier with around 4,000 tracks, I had to resolve several performance bottlenecks at the database and thread layer. Phase 1 Short TX : Claim the target track via FOR UPDATE SKIP LOCKED and immediately flip the status to PROCESSING to isolate rows for worker concurrency. Commit and release the connection. Phase 2 Zero TX : Perform the heavy external network I/O and embedding generation while holding zero active DB connections. Phase 3 Short TX : Open a short transaction to persist the final structured entity data. The entire flow runs over Java 21 Virtual Threads to minimize scheduling overhead during I/O wait states. js WITH candidates AS SELECT , embedding <= CAST :embedding AS vector AS distance FROM cosmos tracks WHERE status = 'COMPLETED' AND cluster id = :clusterId AND id NOT IN :excludeIds ORDER BY embedding <= CAST :embedding AS vector LIMIT :poolSize , diversified AS SELECT , ROW NUMBER OVER PARTITION BY artist ORDER BY distance AS artist rank FROM candidates SELECT FROM diversified WHERE artist rank <= :maxPerArtist ORDER BY distance LIMIT :limit This preserves index efficiency while strictly scattering artist density with a single roundtrip. I deliberately chose a Thymeleaf SSR hybrid architecture to keep the deployment unit single and maintain high operational visibility P6Spy, Actuator instead of splitting into a separate SPA. Live Project: https://cosmos.codeghost.cloud/ https://cosmos.codeghost.cloud/ I'm very happy to discuss any architectural or design decisions. Let me know your thoughts or hit me with any questions