A ~10KB page that sings text — can you decode it back? (pitch-only baseline already hits 43%)

A developer proposes a tiny deterministic singing-code sandbox for synthetic Singing Voice Transcription, breaking the problem into audio-to-notes, audio-to-text, and word-motif alignment layers. The project's generator can produce perfect labels for audio, text, notes, and timing, enabling a reproducible benchmark before scaling to neural ASR models.

This looked interesting, so I checked whether there are reusable pieces around it: Yes — I think this is very plausibly decodable and developable, but I would not start by framing it as only “ASR” or only “a language question”. A useful nearby frame is: a tiny deterministic singing-code / synthetic Singing Voice Transcription sandbox In other words, this can be separated into several layers: audio → notes / motifs audio or motifs → text word ↔ motif alignment That separation seems useful because nearby fields already have reusable tools, metrics, datasets, and failure-mode maps. The interesting part is that this project has something real singing datasets usually do not have: a generator that can produce perfect labels for audio, text, notes/motifs, timing, vowels, and word↔motif alignment. So before jumping to a large neural ASR model, I would probably grow it in this order: toy demo → reproducible benchmark → structured metadata → note/motif scoring → text scoring → word↔motif alignment scoring → stronger non-ML baselines → toy / injective / noisy / style variants → CTC / Wav2Vec2 / SVT-style joint models → human-learnability / code-language boundary questions A particularly useful neighboring area is Singing Voice Transcription , especially newer work that tries to unify lyric transcription, note transcription, and lyric-note alignment. In ordinary speech ASR, the target is mostly: audio → text But in singing transcription, the target is often closer to: audio → lyrics audio → notes / melody lyrics ↔ notes alignment That maps surprisingly well onto this project: | SVT / singing transcription layer | This project’s analogue | |---|---| | lyrics transcription | audio → original text | | note transcription | audio → note/motif sequence | | lyric-note alignment | word ↔ motif alignment | | phoneme/vowel alignment | vowel/formant cues inside motifs | | style / singer variation | later style/noisy/timbre variants | | out-of-distribution singing | later robustness tests | So I would treat this less as “general ASR” at first, and more as a controlled synthetic version of an SVT problem. Some nearby references: | Reference | Why it seems relevant | |---|---| | I would split the benchmark into at least three measurable tasks. | Layer | Task | Possible metric family | |---|---|---| Note/motif recovery | recover pitch events, notes, or motif sequence from audio | MIR-style note transcription metrics, motif accuracy | Text recovery | recover original word sequence | word accuracy, WER, CER | Alignment recovery | recover which word corresponds to which motif and when | alignment accuracy, timing error | Robustness | recover the same labels under perturbations | score degradation under noise/compression/reverb/etc. | Useful existing tooling: | Tool | Use | |---|---| | Instead of only storing audio,text , it may be useful to store the generated structure. Something like: { "audio": "sample 0001.wav", "text": "the red bird", "words": { "word": "red", "motif id": "m 014", "notes": "A4", "C5" , "vowels": "e" , "start": 0.42, "end": 0.81 } , "codec version": "loom-v0", "split": "test", "difficulty": "toy", "collision group": null } This would make the dataset usable from several angles: | User wants to test… | They can use… | |---|---| | text decoding | audio , text | | pitch/motif recovery | audio , notes , motif id | | vowel-aware decoding | audio , vowels , word | | alignment | start , end , word , motif id | | codec audit | motif id , collision group , codec version | | robustness | same labels under noisy variants | The DALI dataset is a useful mental model here: it stores lyrics and vocal notes with time alignment and multiple lyric granularities. For this project, the analogous hierarchy could be: note → motif → word → line/sample I would probably avoid jumping straight from the current pitch-only baseline to a large ASR model. A more informative ladder might be: | Baseline | What it tells you | |---|---| | current FFT pitch-only greedy decoder | floor baseline | | pitch-only + dynamic programming | removes some greedy parsing weakness | | pYIN/aubio pitch + DP decoder | better note tracking without full ASR | | pitch + vowel-aware decoder | tests whether vowel/formant information helps | | oracle note sequence → text | isolates codec/parser ambiguity | | oracle note+vowel sequence → text | estimates upper bound from symbolic information | | small CTC model | first neural baseline | | Wav2Vec2/Whisper-like ASR fine-tuning | general ASR transfer baseline | | SVT-style joint decoder | future structured model: audio → motifs + text | This gives more diagnostic value than a single score, because each step answers a different question. To keep the idea clean, I would separate variants rather than mixing all goals into one task. | Variant | Goal | |---|---| toy | current fun/demo version; ambiguity allowed | injective | collision-free / prefix-safe version for stricter codec testing | noisy | compression, additive noise, reverb, resampling, time stretch | style | timbre, vowel, vibrato, voice/singer variation | physical | speaker-to-microphone playback | open-vocab | held-out words or motifs | human | learnability and human decoding experiments | The injective variant would be useful if the question is “can this be a lossless audio code?” The toy variant is still useful if the question is “how far can decoders get under ambiguity?” The noisy/style variants are useful if the question becomes closer to ASR/SVT robustness. SVT is useful for the singing/transcription side, but the codec side also has nearby prior art. | Reference | Relevance | |---|---| | This suggests another layer of knobs: symbol duration note gap motif gap word gap start/end markers checksum error correction payload rate collision report That does not mean this should become a modem. It just means the modem world has useful engineering vocabulary for controlled acoustic codes. I would keep the language question, but separate it into layers: | Layer | Question | |---|---| | recoverability | can a decoder recover the source text? | | unique decodability | is the audio code unambiguous in principle? | | robustness | does it survive noise, timing changes, compression, etc.? | | learnability | can humans learn it without memorizing arbitrary pairs? | | convention | could multiple users share it consistently? | | productivity | can it support new utterances systematically? | | language-like behavior | does it go beyond codebook lookup? | Nearby conceptual references: | Area | Why it helps | |---|---| | So I would phrase the language angle carefully: High decoding accuracy would be interesting, but it would mostly show recoverability under the chosen code and evaluation setup. Human learnability, convention, and productivity are separate next questions. That keeps the door open without overclaiming. If I were trying to grow this with existing pieces, I would probably do: Keep the current toy benchmark reproducible Add structured metadata Add two scoring tracks mir eval -style note/motif scoring jiwer -style text scoring Add an alignment track Add a baseline ladder Split variants Only then try heavier models The strongest part, to me, is that this creates a small deterministic audio world where several normally tangled problems can be tested separately: codec design acoustic decoding motif recovery text recovery vowel cues alignment noise robustness human learnability code/language boundary Real SVT datasets have annotation cost, singer variability, accompaniment, noisy alignment, and style differences. This toy system can start with none of that, then add complexity deliberately. So the most useful framing may be: a controllable synthetic SVT/audio-code benchmark, with perfect labels and gradually adjustable ambiguity. That seems like a nice place to test simple decoders first, then stronger ASR/SVT models later.