Open Source tech jobs portal and database

Caio, an open-source tech jobs platform, launched a public job index and search portal at caio-jobs.com to reduce the manual burden of job hunting for software professionals. The project, currently a searchable database with a Phoenix/Elixir web app and Ruby on Rails crawler, plans to evolve into a supervised job-search agent that automatically finds relevant roles, tailors application materials, and tracks outcomes. The early-stage repository prioritizes a working crawler and simple deployment over polished architecture, using SQLite for speed while candidates remain responsible for filtering and applying.

Caio is an open-source attempt to make job hunting less manual for software professionals. The current product is a public tech-job index. That is useful on its own, but it is not the end goal. The job board is the data layer and acquisition surface for a larger product: a supervised job-search agent that can continuously find relevant roles, adapt application material, and help a candidate apply without spending hours repeating the same search/forms/CV-tweaking loop. This repo is early, practical, and intentionally boring in places. It favors a working crawler, simple deploys, server-rendered pages, and observable user flows over a polished distributed architecture. Live site: caio-jobs.com https://caio-jobs.com Job boards mostly move the filtering burden onto candidates. The painful part is not only finding jobs; it is repeatedly deciding whether a role is relevant, editing the same CV for each posting, filling forms, tracking what happened, and doing it again tomorrow. Caio starts with the searchable job corpus because the agent needs one. From there, the useful product becomes: - Keep finding fresh jobs that match a candidate profile. - Explain why a job is or is not a good fit. - Tailor CV/application material to the role. - Track applications and outcomes. - Let the user supervise the workflow instead of manually doing every step. Caio is a monorepo with two main apps: php public job sources - crawler workers - SQLite job posts - Phoenix portal - leads + job interests tracking crawler/ : Ruby on Rails plus Sidekiq workers for collecting, normalizing, deduplicating, and storing public job postings. portal/ : Phoenix/Elixir web app for the public search experience, profile unlock flow, GitHub login, analytics, and apply-click tracking. deploy/ : production deployment scripts and systemd units for a single Google Cloud VM. marketing/ : launch copy and social-post drafts. The current production-friendly setup intentionally uses SQLite. That is not a claim that SQLite is the final architecture; it is just the fastest path to a small, understandable system while the product is still being shaped. The natural next step is Postgres plus separate crawler/web machines. - Shared SQLite database between Rails ingestion and Phoenix serving. - SQLite FTS5 index maintained by Phoenix migrations and triggers. - Rails/Sidekiq crawler split into source fanout, fetch, detail, and write queues. - Normalization for salary, location, source keys, canonical URLs, and job quality. - Server-rendered Phoenix UI with minimal JavaScript. - GitHub OAuth and email unlock flow feeding a simple leads table. - PostHog events for search, unlock, login, job detail views, and apply clicks. - Single-VM production deployment with systemd, Caddy, Redis, and SQLite backups. - Public landing page with SEO and social sharing metadata. - Full-text search across title, company, location, tags, category, and description. - Guest preview with a free unlock flow. - GitHub OAuth login. - Lead/profile capture with email, optional LinkedIn URL, target role, target location, and job-help consent. - Apply-click tracking before redirecting users to the original job source. - Company stats based on the number of visible open jobs in Caio. - PostHog analytics hooks for page views, unlocks, GitHub login, and apply clicks. - Some crawler paths still reprocess old pages instead of storing complete cursor state per paged source. - Import metrics currently blur inserts and updates in some paths. - SQLite is acceptable for this stage, but it will need a more deliberate data architecture as write volume grows. - The agent layer is not here yet; today this is the search/indexing foundation. - Source adapters need ongoing maintenance because public job endpoints change, rate-limit, or disappear. . ├── bin/ Local orchestration helpers ├── crawler/ Rails + Sidekiq ingestion system ├── deploy/google-cloud/ VM bootstrap, Caddy, systemd, backup docs ├── marketing/ Launch assets and copy └── portal/ Phoenix web interface - Ruby with Bundler - Redis for Sidekiq - Elixir/Erlang, preferably via .tool-versions and mise - SQLite with FTS5 support - Docker, if you use the local stack helper From the repository root: cp .env.example .env bin/run local stack --restart This starts: - Docker Redis as caio-redis - Sidekiq writer/fetch/source workers - Rails Sidekiq UI at http://localhost:3001/sidekiq - Phoenix portal at http://localhost:4000 You can also start pieces independently: bin/run local stack portal bin/run local stack sidekiq-web bin/run local stack workers If Redis is loading a large persisted queue, increase the startup wait: REDIS READY TIMEOUT=900 bin/run local stack --restart Run crawler setup: cd crawler bundle install bin/rails db:migrate bundle exec sidekiq -C config/sidekiq sources.yml Run the portal: cd portal mix setup mix ecto.migrate mix phx.server Open: http://127.0.0.1:4000 In development, the portal reads the crawler database at: crawler/db/development.sqlite3 Use .env.example as the local template. Do not commit real secrets. Common local variables: GITHUB CLIENT ID= GITHUB CLIENT SECRET= GITHUB REDIRECT URI=http://localhost:4000/auth/github/callback POSTHOG ENABLED=false POSTHOG PUBLIC KEY= POSTHOG HOST=https://us.i.posthog.com POSTHOG SESSION REPLAY=true Important production variables: PHX HOST=caio-jobs.com SECRET KEY BASE=... DATABASE PATH=/var/lib/caio/caio.sqlite3 JOB CRAWLER DATABASE=/var/lib/caio/caio.sqlite3 GITHUB REDIRECT URI=https://caio-jobs.com/auth/github/callback Portal: cd portal mix compile mix test mix format mix assets.deploy MIX ENV=prod mix release --overwrite Crawler: cd crawler bundle exec rails db:migrate bundle exec sidekiq -C config/sidekiq fetch.yml bundle exec sidekiq -C config/sidekiq writer.yml bundle exec sidekiq -C config/sidekiq sources.yml Queue inspection: redis-cli LLEN queue:source fetchers redis-cli LLEN queue:linkedin pages redis-cli LLEN queue:job writes redis-cli ZCARD retry redis-cli ZCARD dead The current deployment path is a single Google Cloud VM running: - Phoenix release - Rails/Sidekiq crawler workers - Redis - Caddy - SQLite database on persistent disk See deploy/google-cloud/README.md /danicuki/caio/blob/main/deploy/google-cloud/README.md for the full VM bootstrap, systemd, Caddy, release, and backup workflow. The short deploy loop after pulling changes is: cd /srv/caio/crawler bundle install RAILS ENV=production JOB CRAWLER DATABASE=/var/lib/caio/caio.sqlite3 bundle exec rails db:migrate cd /srv/caio/portal mix deps.get --only prod MIX ENV=prod mix assets.deploy MIX ENV=prod DATABASE PATH=/var/lib/caio/caio.sqlite3 mix ecto.migrate MIX ENV=prod mix release --overwrite sudo systemctl restart caio-portal caio-sidekiq-writer caio-sidekiq-fetch caio-sidekiq-sources Generated data stays out of git: - SQLite databases and WAL/SHM files - Redis dumps - logs - Phoenix build , deps , and compiled assets - generated crawler indexes and large crawl artifacts Commit source code, migrations, small config data, docs, and launch assets. - Never commit OAuth secrets, PostHog keys, production database files, or backups. - Keep user contact collection explicit and transparent. - The analytics wrapper strips sensitive property names such as email, token, and secret before sending server-side events. - Apply clicks are tracked in job interests before redirecting to the original job source. - Add stateful crawler cursors for every paged source so production resumes from known progress instead of reprocessing old pages. - Split crawler import metrics into inserted vs updated counts. - Move from SQLite to Postgres when write volume or operational needs require it. - Add company profile enrichment, including async external reputation data where allowed. - Build the job-agent layer: saved profiles, tailored application material, job matching, and supervised automated application workflows. No license has been added yet. Until a license is present, all rights are reserved by the repository owner.