cd /news/generative-ai/show-hn-stable-audio-3-one-shot-samp… · home topics generative-ai article
[ARTICLE · art-19176] src=github.com pub= topic=generative-ai verified=true sentiment=· neutral

Show HN: Stable Audio 3 – one-shot sample generator (110gb download)

Stability AI released Stable Audio 3, a one-shot sample generator that requires an 110GB download, with the v3 libraries hosted publicly on Google Cloud Platform without authentication. The tool, part of the Signals & Sorcery app, allows users to generate large batches of drum and pitched instrument audio samples using rented GPU power from RunPod, with a full 50-category campaign costing under $50 and taking up to a day to complete.

read28 min publishedMay 31, 2026

The latest v3 libraries are hosted publicly on GCP (no auth required). The Signals & Sorcery app installs these automatically, but you can grab them directly here:

Pack Contents Download
Drums (v3 large)
24 roles · 10,359 one-shots (+ prompt sidecars) · ~1.4 GB

Instruments(v3 large)sas-instrument-pack-v3-large.zipEach zip contains a

_pack-version.json

marker plus the payload tree (drums:<role>/*.wav

; instruments:<category>/<id>/manifest.json

+zones/

). The instrument pack is zones-only (the 24-bit generationsources/

are omitted — they aren't used at playback).

Generate large batches of audio samples with Stable Audio 3 on a rented RunPod GPU. Two pipelines ship side by side:

Drums / one-shots(run_all.sh

) — 24 unpitched categories (kicks, snares, hats, claps, 808s, risers, impacts, textures…). Generate → quality-gate → trim/normalize → flatprocessed/<role>/

folders.Pitched instruments(run_pitched.sh

) — 28 tuned categories (pianos, basses, pads, leads, strings…). Generate → pitch/quality-gate → multi-source pitch-correct + pre-render playable zones →instruments/<cat>/<id>/manifest.json

.

Both run the same retry-to-target loop (re-roll failed prompts until ~150 samples per category survive the gate) and batched generation on one model load.

Read top to bottom, copy-paste each command block. A single-category test slice is ~5 minutes and a few cents; the full ~50-category v3 campaign is up to ~a day of big-GPU pod time (still well under $50 — see Cost).

Assumes an Apple Silicon Mac as the control machine.

For the rationale (why these settings, prompt-design tips, deep cost math), see stable_audio_open_batch_oneshot_guide.md.

Part of the

[Signals & Sorcery]family. See[Related repos]at the bottom of this README.

Plain-text prompt files — one description per line — under prompts/

(drums) and prompts/pitched/

(instruments). The repo ships 52 categories, 200 prompts each (~10,400 prompts), pre-generated by scripts/gen_prompts.py in a combinatorial house style (~58% EDM / ~25% hip-hop & urban / ~17% acoustic-orchestral-world). Run them as-is, edit them, or subset which categories generate.

Each non-comment line becomes one generation job. Example from prompts/kick.txt:

tight 909-style kick drum one shot, hard click transient, short punchy body, dry
punchy 909 kick drum one shot, sharp transient, controlled low end, clean studio sample

deep 808 kick one shot, long sub bass decay, smooth sine low end, dry
warm 808 kick one shot, saturated low end, medium decay, dry, no melody, no loop

Blank lines and lines starting with #

are ignored (handy for grouping). Aim for ~10 words per line. For drums, always include phrases like one shot, no loop

so the model doesn't render a rhythmic loop.

Editing prompts changes content-hashes.Output filenames are content-addressed ({category}-{hash}.wav

), so re-wording a line orphans the WAV it used to produce. Finalize wordingbeforea GPU run. Re-running with--skip-existing

then only generates the new/changed lines.

Flat processed/<role>/

output; the folder name is the drum role.

Core kit Dur EDM / electronic one-shots Dur
kick
1.5s clap
0.75s
snare-standard
1.0s 808 (tuned sub one-shot)
2.0s
snare-rim
0.75s riser
4.0s
hat-closed
0.5s downlifter
3.0s
hat-open
1.5s impact
2.0s
cymbal-ride
2.5s sub-drop
2.0s
cymbal-crash
3.0s sweep
2.5s
cymbal-splash
1.5s texture (vinyl/foley/glitch)
3.0s
tamborine
1.0s zap
0.75s
shaker
0.75s foley-perc
0.75s
tom-hi / tom-mid / tom-low
1.0–1.5s
hit (generic stab)
1.5s

instruments/<cat>/<id>/

with a manifest.json

. Source pitches, durations, and variant counts are in the Pitched-instrument pipeline table below.

synths  lead-supersaw  lead-fm  lead-acid  pluck-synth  plucks  keys  pianos
organs  basses  808-bass  reese-bass  pads  strings  brass  winds  accordion
bells  mallets  percussion  timpani  guitars  banjos  mandolin  harp  sitar
vocals  choir

Output filenames are content-addressed: {category}-{hash}.wav

. Same prompt + seed → same filename → safely re-runnable with --skip-existing

.

To subset what generates, edit scripts/categories.txt (drums) or

(pitched) — comment out any line to skip that category.

scripts/pitched_categories.txt

, accept the license on that model's page too —stable-audio-3-small-sfxis the lighter 0.6B SFX-tuned alternative. - Create a read-only access token at huggingface.co/settings/tokens. Save it in your password manager — you'll paste it once per pod.

pbcopy < ~/.ssh/id_ed25519.pub        # copies key to clipboard

~/.ssh/id_ed25519.pub

doesn't exist:ssh-keygen -t ed25519

first, accept defaults.)

runpod.io/console/podsDeploy → GPU Pod:

Setting Value
GPU RTX A6000 (48 GB VRAM, ~$0.49/hr)
Template most recent RunPod PyTorch with CUDA 12.x
Container Disk 50 GB (default)
Volume Disk 100 GB at /workspace
Expose SSH (port 22) — default

Click Deploy On-Demand. Wait ~30 sec until status is RUNNING

.

On the pod's card click Connect → SSH over exposed TCP and copy the SSH command. It looks like:

ssh root@<POD_IP> -p <POD_PORT> -i ~/.ssh/id_ed25519

From your Mac terminal, paste the SSH command from step 1. Type yes

to accept the host key on first connect.

If you get Permission denied (publickey)

:

ssh-add ~/.ssh/id_ed25519

…then retry.

On the pod:

cd /workspace && \
git clone https://github.com/shiehn/sas-sample-generator.git && \
cd /workspace/sas-sample-generator && \
./scripts/setup.sh 2>&1 | tee /workspace/setup.log

Why these paths matter (and the reason this used to be slow): /workspace

is a network filesystem (MooseFS) — fine for big sequential reads/writes (model weights, generated audio) but painfully slow for many-tiny-files (a Python venv). The script installs the venv at ** /root/.venv**, which is on the pod's container-local SSD, and only keeps the HuggingFace cache and outputs on

/workspace

. Roughly:

/root/.venv                   ← Python venv          (fast SSD; ~5 min install)
/workspace/sas-sample-generator   ← cloned repo
/workspace/.cache/huggingface ← model weights        (downloaded once)
/workspace/outputs            ← generated WAVs

You're done when you see:

[setup] cuda available: True
[setup] device:         NVIDIA RTX A6000
[setup] done.
[setup] next: source /root/.venv/bin/activate
source /root/.venv/bin/activate
hf auth login

Paste your HF token (One-Time Setup A). Answer n

to "Add token as git credential".

The 14 prompt files are already in prompts/<category>.txt

. To run them as-is, skip to step 6.

To customize:

Edit content:nano prompts/kick.txt

(or scp over your own version, or edit on Mac →git push

git pull

on the pod).Subset which categories run: editand comment out the lines you want to skip. Useful for prompt iteration on a single category.scripts/categories.txt

Wrap the run in tmux

first so an SSH drop doesn't kill the job:

tmux new -s sas
./scripts/run_all.sh 2>&1 | tee /workspace/run.log

(Detach with Ctrl-b d

; reattach later with tmux attach -t sas

.)

The wrapper runs, in order:

Build JSONLslist_to_jsonl.py

turns eachprompts/<cat>.txt

intoprompts/<cat>.jsonl

, stamping a per-categoryvariants

count from(5–6 candidates per prompt; more for high-value kick/snare/clap/808).scripts/drum_gate_config.py

Generate + gate, with retry-to-targetrun_retry.py

drivesbatch_generate.py

(one model load,batchedBATCH_SIZE

generations per call) thengate_drums.py

, which best-of-N selects one winner per prompt and rejects clipped / silent / off-band / multi-hit / wrong-decay samples (per-category profiles indrum_gate_config.py

). Any prompt whose candidatesallfail is re-rolled with fresh seeds (up toMAX_RETRIES=2

); if a category still has fewer thanTARGET=150

survivors, more prompt lines are topped up. Winners land ingated_drums/<cat>/

.Post-processpostprocess_oneshots.py

trims, LUFS-normalizes to**-16 LUFS**(-1 dBFS peak ceiling), and mono-downmixes the gated winners intoprocessed/<cat>/

. Each WAV ships with a sibling<id>.txt

holding its exact prompt; the prompt is also embedded in the WAV's RIFF INFO comment chunk so Logic / Ableton / Audacity /ffprobe

/ macOS Get-Info show it. Merging runs is a singlersync

— no manifest to reconcile.

First call downloads Stable Audio 3 (~5–8 GB for medium, ~3 min, one-time). SA3 converges in ~8 diffusion steps (vs 120 for SAO 1.0), so generation is fast — with the gate + retry, the bottleneck is now the CPU gate, not generation.

LUFS normalization keeps kicks, hats, and splashes at the same perceived volume in your sampler. Revert to peak-only with --normalize peak

, or override with --target-lufs -14

(streaming-hot) / -10

(commercial-hot).

Useful env knobs (all optional):

ONLY="kick clap" LIMIT=10 ./scripts/run_all.sh   # tiny test slice (TARGET auto-0)
GATE=0          ./scripts/run_all.sh             # legacy: keep all raw, no gate
BATCH_SIZE=32   ./scripts/run_all.sh             # bigger batches on an 80GB GPU
TARGET=200 MAX_RETRIES=3 ./scripts/run_all.sh    # chase a higher survivor count
STEPS=4         ./scripts/run_all.sh             # cheaper/faster iteration

Single-category iteration (tuning prompts):

ONLY=kick ./scripts/run_all.sh
ls /workspace/outputs/processed/
find /workspace/outputs/processed -name "*.wav" | wc -l

Every <id>.wav

ships with a sibling <id>.txt

containing its generation prompt. Spot-check the pairing:

test "$(find /workspace/outputs/processed -name '*.wav' | wc -l)" \
   = "$(find /workspace/outputs/processed -name '*.txt' | wc -l)" \
   && echo "wav/txt pairing OK"

On the pod:

cd /workspace
tar czf run.tar.gz outputs/processed
ls -lh run.tar.gz

(We use tar

rather than zip

because the stock RunPod PyTorch image doesn't ship zip

. tar

is preinstalled everywhere. tar

also recurses into the per-category subdirs automatically.)

In a second Mac terminal (don't close the SSH session yet — you still need it for step 10):

cd ~/Downloads
scp -P <POD_PORT> root@<POD_IP>:/workspace/run.tar.gz .
tar xzf run.tar.gz
open outputs/processed                 # Finder + QuickLook to audition

<POD_PORT>

and <POD_IP>

are the same ones from your step-1 SSH command. The unpacked structure is one folder per category:

outputs/processed/
  kick/        kick-c1da23da.wav   kick-e5d95885.wav   ...
  snare-standard/   snare-standard-...wav
  hat-closed/  ...
  ...etc

This is the step you will forget. The pod bills $0.49/hr for as long as it exists, whether you're using it or not.

Idle overnight≈ $12** Forgotten for a week≈ $80 Forgotten for a month**≈ $350

In the RunPod console, click your pod's card → Terminate. Confirm.

Termination wipes /workspace

. That's fine — you have the zip on your Mac. Next month, you start fresh from step 1.

sas-sample-generator/
├── README.md                                   ← you are here
├── stable_audio_open_batch_oneshot_guide.md    ← long-form background
├── requirements.txt
├── docs/SAMPLE_GEN_V3_PLAN.md                  ← v3 design rationale & locked decisions
├── prompts/
│   ├── <drum-cat>.txt        (24 files, 200 prompts each)  kick.txt, clap.txt, 808.txt, …
│   └── pitched/<cat>.txt     (28 files, 200 prompts each)  pianos.txt, basses.txt, pads.txt, …
├── scripts/
│   ├── setup.sh                       ← bootstrap a pod (venv + deps + SA3 tools)
│   ├── gen_prompts.py                 ← (re)generate the 200-prompt corpora
│   │
│   ├── run_all.sh                     ← DRUM pipeline driver
│   ├── categories.txt                 ← which drum roles to run
│   ├── category_config.py             ← per-role negatives + durations
│   ├── drum_gate_config.py            ← per-role gate profiles + variant counts
│   ├── gate_drums.py                  ← drum quality gate + best-of-N
│   ├── list_to_jsonl.py               ← .txt → .jsonl (drums)
│   ├── postprocess_oneshots.py        ← trim / LUFS / mono / tag
│   │
│   ├── run_pitched.sh                 ← PITCHED pipeline driver
│   ├── pitched_categories.txt         ← which instruments to run
│   ├── pitched_category_config.py     ← target pitches, durations, zones, variants
│   ├── gate_pitched.py                ← pitch + quality gate + best-of-N
│   ├── list_to_jsonl_pitched.py       ← .txt → .jsonl (multi-source fan-out)
│   ├── enrich_pitched.py              ← pitch-correct + multi-zone render + manifest
│   ├── pitch_report.py                ← measured-vs-target pitch-accuracy report
│   ├── repair_instrument_pitch.py     ← post-hoc single-source pitch fix
│   │
│   ├── run_retry.py                   ← generate+gate+retry-to-target (both pipelines)
│   ├── batch_generate.py              ← batched SA3 inference
│   ├── build_pack.py                  ← deterministic versioned pack zips
│   ├── README-PACKS.md                ← pack build + publish runbook
│   └── benchmark.py / sync.sh         ← optional helpers
├── tests/                             ← CPU-only suite (see "Tests" below)
└── outputs/                           ← gitignored; all generated audio lands here
    ├── raw/<cat>/<id>_vNN.wav         ← SA3 candidates (N variants/prompt)
    ├── raw/<cat>/_metadata/<id>_vNN.json  ← seed, model, gen params (stays on pod)
    ├── gated_drums/<cat>/<id>.wav     ← drum gate winners
    ├── processed/<cat>/<id>.{wav,txt} ← final drums (trimmed/LUFS/mono; ICMT = prompt)
    ├── gated/<cat>/<id>.{wav,gate.json}   ← pitched gate winners + scores
    ├── _reports/pitch_summary.{json,md}   ← pitch-accuracy report (report stage)
    └── instruments/<cat>/<id>/        ← final instruments (sources/, zones/, manifest.json)

Sibling pipeline to the drum one. Same Stable Audio 3 generator, different downstream: a 5-stage quality+pitch gate (prefilter, onset, sustain-plateau, CREPE/spectral pitch with a sub-bass octave cross-check, BasicPitch polyphony when the TF/numpy ABI is happy), multi-source pitch correction, and zone pre-rendering (RubberBand R3, formant-preserving). Emits a per-instrument manifest.json

consumed by sas-instrument-plugin

.

Multi-source real-pitch sampling (the v3 headline). Wide-range instruments are generated at 2–4 real source pitches spanning their natural register (e.g. basses at E1/E2/E3). Enrich assigns each playable zone to its nearest real source, so no zone is pitch-shifted more than ~half the inter-source gap — small, artifact-free shifts instead of stretching one root ±12 semitones. Toggle with SAS_MULTI_SOURCE=1

(default; 0

reproduces v1 single-source for an A/B).

28 categories ship, 200 prompts each (5,600 prompts):

Category Source pitch(es) Dur Var Notes
synths
C3 5.0s 6 analog mono, FM, wavetable, acid
lead-supersaw
C4·C5 5.0s 6 multi-source
lead-fm
C4 5.0s 6
lead-acid
C3 4.0s 6 303-style
pluck-synth
C4 3.0s 6
plucks
C4 3.0s 6
keys
C3 5.0s 6 Rhodes, Wurli, clav, DX7
pianos
C2·C3·C4·C5 5.0s 6 multi-source, step 2
organs
C3 8.0s 8 open-ended
basses
E1·E2·E3 6.0s 8 multi-source, 30 Hz floor
808-bass
C2·C3 6.0s 8 multi-source, 25 Hz floor
reese-bass
E2·E3 6.0s 8 multi-source
pads
C3 12.0s 8 open-ended, step 2
strings
A2·A3·A4 8.0s 8 multi-source, open-ended, step 2
brass
A2·A3·A4 6.0s 8 multi-source, open-ended
winds
D3·D4·D5 5.0s 8 multi-source, open-ended, step 2
accordion
F#3 6.0s 8 open-ended
bells
C5 4.0s 6 glockenspiel, FM, music box
mallets
C4 3.0s 6 marimba, vibes, kalimba
percussion
C4 2.0s 6 tonal/tuned only
timpani
F2·F3 4.0s 8 multi-source, tuned drum
guitars
E2·E3·E4 4.0s 6 multi-source
banjos
C4·G4 3.0s 6 multi-source
mandolin
A4 3.0s 6
harp
C3·C4·C5 4.0s 6 multi-source
sitar
C4 4.0s 6
vocals
A3 5.0s 20 choir, chops, vocoded (SA3 vocals are weak)
choir
A3 10.0s 18 open-ended

"Var" = SA3 candidates per (prompt × source pitch); the gate keeps the best. Pitches use the C4 = MIDI 60 convention; multi-source counts apply with SAS_MULTI_SOURCE=1

(default). Full config in scripts/pitched_category_config.py.

To subset which categories run, edit scripts/pitched_categories.txt — comment any line with

#

to skip that category.scripts/setup.sh

installs everything the pitched pipeline needs:

  • All Python deps from requirements.txt

(librosa, torchcrepe, basic-pitch, pyloudnorm, soxr, …) - All system packages — rsync

(transfer to Mac),tmux

(long sessions survive disconnect),rubberband-cli

(enrich shells out to it directly for formant-preserving pitch shift),ffmpeg

(audio inspection),zip

/unzip

(archives) - The compiled stable-audio-tools from git main (the PyPI release doesn't support SA3-medium)

  • CUDA-12.8 torch wheels (Blackwell-compatible, also works on older Hopper/Ampere/Ada)

You should never need to apt-get or pip install anything on a pod after running setup.sh

. If you do, treat it as a bug in setup.sh

and add it there.

brew install rubberband                          # pitch-shift backend
cd ~/path/to/sas-sample-generator                # your local repo
pip install -r requirements.txt                  # one-time, ~5 min

Designed to be safely repeatable from a cold start. The whole pipeline: ~15 min setup + ~30 min generate + ~10 min gate + ~30 min enrich + transfer.

runpod.io/console/podsDeploy → GPU Pod:

Setting Value Why
GPU RTX A6000 / 4090 / 5090 / L40S / A100 (24+ GB VRAM)
SA3-medium fits in 16 GB; 24 GB gives headroom
Template most recent RunPod PyTorch with CUDA 12.x
matches our cu128 wheels
Container Disk
100 GB
persistent across pod restart; holds venv + HF model cache + outputs
Network Volume
None
RunPod's "migrate to new host" flow has been known to attach a tiny 10 GB network volume — don't let it. We use container disk only
Expose SSH (port 22, default)

Critical: the field is named "Container Disk" — the persistent SSD. Do NOT confuse with "Network Volume" or "Volume Disk".

Click Deploy On-Demand. Wait ~30 sec for status RUNNING

.

Copy the SSH command from Connect → SSH over exposed TCP. It looks like:

ssh root@<POD_IP> -p <POD_PORT> -i ~/.ssh/id_ed25519
ssh root@<POD_IP> -p <POD_PORT> -i ~/.ssh/id_ed25519

Type yes

on first connect. If Permission denied

: ssh-add ~/.ssh/id_ed25519

.

cd /workspace && \
git clone https://github.com/shiehn/sas-sample-generator.git && \
cd /workspace/sas-sample-generator && \
./scripts/setup.sh 2>&1 | tee /root/setup.log

Look for these "OK" markers near the end of setup.log:

[setup]   rsync:          rsync version 3.x.x ...
[setup]   tmux:           tmux 3.x
[setup]   rubberband:     /usr/bin/rubberband
[setup]   ffmpeg:         ffmpeg version ...
[setup] cuda available: True
[setup] device:         NVIDIA RTX 4090
[setup] done.

If cuda available: False

→ you deployed onto a CPU template; terminate, redeploy with PyTorch GPU.

source /root/.venv/bin/activate
hf auth login

Paste your HF read token. Answer n

to "Add token as git credential".

First time on a new HF account: in your browser, visit stabilityai/stable-audio-3-medium and accept BOTH the SA3 community license AND the underlying Gemma terms. Without both, the model download fails with GatedRepoError

. Token IS your account — accept while logged into the same HF account your token belongs to.

Verify access (under 5 seconds):

hf download stabilityai/stable-audio-3-medium model_config.json --local-dir /tmp/sa3-test
ls /tmp/sa3-test/

If model_config.json

is listed: cleared.

The default ships with all 16 categories enabled. For a quick test or a focused run:

nano scripts/pitched_categories.txt

After the test, restore with git checkout scripts/pitched_categories.txt

.

Run generate + gate + report on the GPU pod; enrich runs later on your Mac (it's CPU-bound).

tmux new -s pitched

cd /workspace/sas-sample-generator
source /root/.venv/bin/activate
source /workspace/.bash_env

STAGES=generate,gate,report ./scripts/run_pitched.sh 2>&1 | tee outputs/run.log

run_pitched.sh

builds the JSONLs (multi-source fan-out — one job per prompt × source pitch, with the per-category variants

count), then run_retry.py

drives batched generation + gate_pitched.py

with the same retry-to-target loop as drums (re-roll all-fail prompts up to MAX_RETRIES=2

, top up until TARGET=150

instruments survive). The report

stage then writes outputs/_reports/pitch_summary.{json,md}

so you can read measured-vs-target pitch accuracy before transferring.

Detach with Ctrl-b d. The run keeps going even if SSH drops.

Reattach later (from any new SSH session — possibly a new IP/port if migrated):

tmux attach -t pitched

Monitor from outside tmux:

tail -f outputs/run.log
nvidia-smi

Throughput: batched generation is ~1–3 h; the CPU gate is the bottleneck (torchcrepe + basic-pitch + librosa per variant). For all 28 categories at full variant counts, budget up to ~a day of pod time. Dial variant counts down in pitched_category_config.py

, or run a subset, if wall-clock matters.

Useful knobs: ONLY=pianos LIMIT=5

(tiny slice), BATCH_SIZE=32

(80 GB GPU), MAX_RETRIES=0

(one pass, no retry), INIT_ANCHOR=1

(experimental init_audio pitch anchoring — default off), SAS_MULTI_SOURCE=0

(single-source A/B).

When STAGES=generate,gate

finishes, before transferring:

for d in outputs/gated/*/; do
  cat=$(basename "$d")
  [[ "$cat" == "_failures" ]] && continue
  passed=$(ls "$d"*.wav 2>/dev/null | wc -l)
  failed=$(ls "$d/_failures"/*.json 2>/dev/null | wc -l)
  total=$((passed + failed))
  if [[ $total -gt 0 ]]; then
    rate=$((passed * 100 / total))
    printf "  %-18s passed=%3d  failed=%3d  pass-rate=%d%%\n" "$cat" "$passed" "$failed" "$rate"
  fi
done

echo "Total gated: $(find outputs/gated -name '*.wav' -not -path '*_failures*' | wc -l)"
du -sh outputs/gated

Expected (with the current thresholds, 2026-05-22): 80–100% pass rate per category. If a category is below 50%, look in outputs/gated/<cat>/_failures/<id>.json

to see why prompts are failing.

The pod has rsync

installed by setup.sh

. On your Mac:

mkdir -p ~/sas-pitched-out
rsync -avzP -e "ssh -p <POD_PORT> -i ~/.ssh/id_ed25519" \
  root@<POD_IP>:/workspace/sas-sample-generator/outputs/gated/ \
  ~/sas-pitched-out/gated/

For ~4 GB at typical RunPod / home upload speeds, expect 10–20 min. rsync

resumes on interruption — just re-run the same command if SSH drops.

Verify locally:

find ~/sas-pitched-out/gated -name '*.wav' -not -path '*_failures*' | wc -l   # should match step 7
du -sh ~/sas-pitched-out/gated
cd ~/path/to/sas-sample-generator
git pull                                # pick up any threshold updates
pip install -r requirements.txt         # idempotent

export SAS_OUTPUTS_DIR=~/sas-pitched-out
STAGES=enrich ./scripts/run_pitched.sh

Enrich groups the surviving source pitches of each prompt into one multi-source instrument under ~/sas-pitched-out/instruments/<cat>/<id>/

:

sources/<midi>.wav

24-bit real source samples (1 per source pitch: 2–4 for multi-source categories), pitch-corrected + normalized to -20 LUFSzones/<midi>.wav

16-bit WAV pre-rendered playable zones (every 2–3 semitones), each rendered from its nearest real source (was 24-bit FLAC pre-v3; WAV is memory-mapped by the Tracktion sampler with no decode stall)manifest.json

schema_version: 1

, disjoint ordered zonesprompt.txt

— original positive prompt

It parallelizes across instruments (ProcessPoolExecutor

) and shells out to the rubberband

CLI for pitch shifts (brew install rubberband

on the Mac).

runpod.io/console/pods → pod card → Terminate (NOT Stop). Compute billing stops immediately. Volume billing (if any auto-created Network Volume snuck in) stops only on Terminate.

Then runpod.io/console/user/storageNetwork Volumes → check for any outside_*

orphan from a migration → Delete.

RunPod sometimes moves your pod to a different physical host mid-run. Symptoms:

  • SSH connection drops mid-session Connection refused

when reconnecting on the same IP/port- Pod shows "Stopped" briefly, then "Running" again at a new address

The pod, the venv, the HF cache, and all outputs/ data persist on the container disk as long as the pod isn't terminated. You just need fresh connection info.

  • Open RunPod console → click your pod card → check the Connect → SSH over exposed TCP panel for the new IP and port (both can change). - Clear the old SSH host key on your Mac:
ssh-keygen -R '[<NEW_IP>]:<NEW_PORT>'
  • SSH back in with the new details. Run tmux attach -t pitched

— your run is still going. - If you were mid-rsync, just re-run the rsync command with the new -p <NEW_PORT>

androot@<NEW_IP>

— it picks up where it stopped.

This bit us twice this session (May 2026). Symptoms are unambiguous; recovery takes 30 seconds.

SA3 doesn't reliably hit a target pitch from a text prompt — that's a known limitation of text-to-audio diffusion models. Enrich now compensates intelligently:

If measured pitch is… Enrich does… Result
within max_correction_semitones of target (default 3)
shifts all the way to the original target Sample is at exactly the prompted MIDI note; preserves prompt semantics
further away than that snaps to the nearest integer semitone
Sample is at the closest "logical" MIDI note (always ≤50 cent shift, no audible artifacts)

Either way: every output sample lands on an exact MIDI semitone with the smallest possible pitch shift. The zone rendering loop centers on that effective root, so the sampler always has a clean zone at the sample's actual pitch.

max_correction_semitones

is per-category in scripts/pitched_category_config.py

. Set to 0

to always snap to nearest semitone (never shift to target). Set to a large value (24+) to always shift to target.

Stage What it checks What rejection means
prefilter
Clipping, dead channels, all-silent buffers Sample is broken at the file level
onset
Time from buffer start to first transient slow_onset → SA3 added a fade-in / silence preamble (>300ms)
sustain
Longest plateau within 12 dB of peak RMS short_stab → audio decays too fast or has no held region
pitch
CREPE periodicity + measured-vs-target no_voiced_frames / unconfident → unpitched output; (tolerance 9999) so enrich's snap-to-nearest-semitone can do its jobwrong_pitch is OFF by default
polyphony
BasicPitch note count after vibrato bypass Disabled when TF/numpy ABI mismatches (common on RunPod) — gate prints one warning at start, then runs without it

The gate scores winners by confidence² × exp(-|cents|/50) × sus_quality

. With wrong_pitch

disabled, the pitch term collapses to ~0 for far-off samples, so all variants of a prompt can tie at score=0.000 — the picker just grabs v00 by default in that case. Acceptable for now.

outputs/
├── raw/<category>/                                  ← SA3 candidates (N variants per prompt × source pitch)
│   ├── <id>_v00.wav, <id>_v01.wav, ...
│   └── _metadata/<id>_v0N.json                      ← seed, model, generation_seconds, batch_size
├── gated/<category>/                                ← gate winners only
│   ├── <id>.wav                                     ← chosen variant
│   ├── <id>.gate.json                               ← per-gate scores + measured pitch
│   └── _failures/<id>.json                          ← prompts where ALL variants rejected
├── _reports/pitch_summary.{json,md}                 ← measured-vs-target accuracy (report stage)
└── instruments/<category>/<instrument-id>/          ← final library, sampler-consumable
    ├── sources/<midi>.wav                           ← 24-bit real source pitches (1–4)
    ├── zones/<midi>.wav                             ← 16-bit WAV pre-rendered zones
    ├── manifest.json                                ← schema_version 1, disjoint ordered zones
    └── prompt.txt                                   ← original positive prompt

The v3 full run is much bigger than a quick slice: 28 categories × 200 prompts × 2–4 source pitches × 6–20 variants = tens of thousands of candidates. Generation is cheap and batched; the CPU gate dominates wall-clock (per-variant CREPE + basic-pitch + librosa). On an A100 80 GB (~~$0.89–1.89/hr) the whole pitched campaign runs in roughly ~~$15–35**, with drums adding a few dollars. The original $200 budget is never the binding constraint.half a day to a day → **

To cut cost/time: run fewer categories (ONLY=…

), lower variant counts in pitched_category_config.py

, or set MAX_RETRIES=0

. A single-category slice (ONLY=pianos LIMIT=5

) is a few minutes and a few cents — use it to dial prompts before committing the full campaign. Enrich is local on your Mac → $0.

prompts/pitched/<category>.txt                # one prompt per line, # comments
scripts/pitched_categories.txt                # which categories to run (comment to skip)
scripts/pitched_category_config.py            # per-category target pitch, duration, sustain thresholds, etc.

Fast iteration on a single category:

ONLY=<cat> LIMIT=10 STAGES=generate,gate,report ./scripts/run_pitched.sh

Re-author / extend the corpora with scripts/gen_prompts.py (deterministic, preserves existing lines so hashes stay stable, holds the EDM ratio).

--skip-existing

in batch_generate.py

means re-running won't regenerate samples you already have — only new prompt lines hit the GPU.

STAGES=generate,gate,enrich,report

— pitched only; comma-separated subset (default: all four)STEPS=8

— diffusion steps (SA3 converges fast)BATCH_SIZE=16

— generations per model call (32–64 on an 80 GB GPU)TARGET=150

— per-category minimum surviving samples (0

= no top-up)MAX_RETRIES=2

— re-roll rounds for all-fail prompts before topping up (0

= one pass)ONLY="a b"

/LIMIT=N

— run a subset of categories / cap prompts per category (test slice; auto-setsTARGET=0

)GATE=0

— drums only; skip the quality gate (legacy keep-all)INIT_ANCHOR=1

— pitched only; experimental init_audio pitch anchoring (default off)SAS_MULTI_SOURCE=0

— pitched only; disable multi-source (single root, span 12; for an A/B)SAS_OUTPUTS_DIR=/path

— override outputs location (default/workspace/outputs

on pod,./outputs

local)

Per-prompt variant counts are per-category now (in drum_gate_config.py

/ pitched_category_config.py

), not a global env var.

The sas-instrument-plugin

walks outputs/instruments/<cat>/<id>/

, parses each manifest.json

, and uses the zones[]

array to call host.setTrackInstrumentSampler

on the chosen track. Disjoint zones + per-zone root_midi

mean the engine pitch-shifts the nearest pre-rendered zone for any played MIDI note, with the smart-corrected sample as the unshifted root. Since enrich locks every sample to an integer MIDI semitone, the sampler never has to deal with off-pitch sources.

Generated audio ships to the Signals & Sorcery app as versioned pack zips built by scripts/build_pack.py — deterministic (fixed mtimes + sorted entries → byte-identical zip + sha256 from an identical source tree):

Pack Source dir Zip Approx size (v3)
Drums outputs/processed/
sas-drum-pack-v{N}.zip
~2–3 GB (24 roles)
Instruments outputs/instruments/
sas-instrument-pack-v{N}.zip
~20–24 GB (28 cats × ~150)
Loops outputs/loops/
sas-loop-library-v{N}.zip
external loop library (not generated here)
python scripts/build_pack.py --smoke-test                   # cheap determinism check
python scripts/build_pack.py --pack drums --version 1       # real build → ./dist/
python scripts/build_pack.py --pack instruments --version 1

Ready-to-consume directories (no zip / download step). To produce the two libraries as folders you can drop straight into the app's install location — <userData>/sample-packs/{drums,instruments}/

— use --format dir

, or the wrapper. Each emits dist/<subdir>/

with the _pack-version.json

marker at its root: exactly the tree the app expects on disk, ready out of the box.

DRUM_VERSION=3 INSTRUMENT_VERSION=3 ./scripts/build_libraries.sh   # → dist/drums/, dist/instruments/
rsync -a dist/drums/        ~/Library/Application\ Support/signals-and-sorcery/sample-packs/drums/
rsync -a dist/instruments/  ~/Library/Application\ Support/signals-and-sorcery/sample-packs/instruments/

The marker version

must equal that pack's expectedVersion

in sas-app/src/shared/constants/sample-packs.ts

(plain string match) or the app treats the folder as a different version. For GCP distribution use --format zip

(default) and follow the publish runbook below.

The build prints sizeBytes

  • sha256

; paste those into sas-app/src/shared/constants/sample-packs.ts

(bump expectedVersion

  • the download URL) so the app detects the new version and prompts a re-download. Never overwrite a published version — always bump. Full publish runbook (GCP upload, version rules, what's in/excluded, the v3 WAV-zone note): scripts/README-PACKS.md.

CPU-only suite — no GPU, no Stable Audio. It covers everything around model generation: pitch detection, both gates, multi-source enrich, retry-to-target helpers, config/prompt/enable wiring, loudness targets, list_to_jsonl

, the pitch report, and the deterministic pack builder.

./tests/run_tests.sh

Keep it green — run before and after any pipeline-code change. Deps live in the project .venv

(numpy, soundfile, pyloudnorm, librosa); the rubberband

CLI is optional (the enrich test degrades gracefully without it).

Symptom Most likely cause Fix
Permission denied (publickey) on ssh
private key not loaded into agent ssh-add ~/.ssh/id_ed25519
setup.sh hangs at Installing collected packages: for >5 min
something redirected the venv onto /workspace (MooseFS); script defaults to /root/.venv for a reason
check echo $VENV_DIR — should be /root/.venv . If overridden, unset it and re-run
cuda available: False after setup.sh
picked a CPU template terminate; re-deploy with PyTorch GPU template
huggingface_hub.utils._errors.GatedRepoError
didn't accept the SA3 license (or the Gemma terms it inherits) visit the

batch_generate.py

errors CUDA out of memory

--default-duration

or --num-waveforms-per-prompt 1

one shot, no loop, no hi hats, no snare

to every promptdry, no reverb, no ambience

to promptsnegative_prompt

may be excluding the target — bug in scripts/category_config.py

scripts/category_config.py

, audit the negative for that category; nothing in it should match the target soundrun_all.sh

skips a categoryprompts/<cat>.txt

is missing or has only commentsprompts/

has the .txt file and contains non-comment linesrejected/

with reason=lufs_unmeasurable

pyloudnorm

and check they cluster near --target-lufs

. The pipeline log at /workspace/run.log

(from step 6) records the as-run postprocess flags for each category.tmux new -s sas

BEFORE running, reattach with tmux attach -t sas

Stable Audio 3 needs only ~8 diffusion steps (vs 120 for SAO 1.0), so generation is fast and batched. Since v3 added the quality gate + retry-to-target, the CPU gate — not generation — sets the wall-clock, and the per-category minimum-survivor target (TARGET=150

) re-rolls until each category fills.

Run shape Where the time goes Rough cost
Single-category slice (ONLY=… LIMIT=… )
a few minutes, gate-bound a few cents
Full drum run (24 roles, gate + retry)
~1 h on a big GPU ~$1–3
Full pitched run (28 cats, multi-source, gate + retry)
½–1 day, gate-bound ~$15–35

Keep a pod alive between iterations and --skip-existing

skips anything already generated — so re-running after a prompt tweak only pays for the new lines. Terminate as soon as generate + gate finish (enrich is local and free).

This is a public repo. Never commit:

  • Hugging Face tokens
  • RunPod API keys
  • B2 / R2 / S3 keys
  • SSH private keys
  • The generated WAVs (gitignored already)

.gitignore

covers .env

, *.token

, *.secret

, outputs/*

. If you ever git add

a file containing a secret by mistake: rotate the secret first, then git rm

  • commit + push. Treat anything that hit main

as compromised.

stable_audio_open_batch_oneshot_guide.md covers:

  • Why Stable Audio 3 vs alternatives (and what changed from SAO 1.0)

  • Prompt-design rules and category-specific templates

  • Optional persistent Network Volume layout (for users running multiple times per week)

  • Optional rclone push to Backblaze B2 / Cloudflare R2 instead of scp

  • Optional custom Docker image

  • Cost-control deep dive

This is one piece of a larger ecosystem around the Signals & Sorcery audio app.

Plugin SDK & templates

sas-plugin-sdk— types, components, and hooks for building generator pluginssas-plugin-template— starter template for new pluginssas-chat-plugin— in-app conversational agent

Built-in plugins

sas-stems-plugin— default AI audio-from-text + stem-splitting pluginsas-loops-plugin— default audio loop / sample pluginsas-synth-plugin— default synth pluginsas-texture-plugin— texture/ambient pluginsas-recorder-plugin— line-in recording plugin

Audio tooling

sas-audio-processor— audio processing utilitiesSignals2Surge— synth patch transfer to Surge XT

Infrastructure

signals-and-sorcery-server— DAWNet API + WebSocket serversignals-and-sorcery-docs— public docs

Other

signalsandsorcery-game-ui— LLM-powered RPG frontendSignalsAndSorcery— earlier VueJS + Web Audio sample arrangement toolErrantry— E2E testing for agent-facing CLIs (drives this project's CLI too)

── more in #generative-ai 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/show-hn-stable-audio…] indexed:0 read:28min 2026-05-31 ·