Extracting an Obsidian Plugin From a Note App

        March 12, 2026

Extracting an Obsidian Plugin From a Note App

Extracting an Obsidian Plugin From a Note App

Extracting an Obsidian Plugin From a Note App
March 12, 2026 · 5 min read
Most people who build a note app eventually face the question: do I keep adding features, or do I let users take this into the editor they already use?
I picked the second option. Took the voice transcription pipeline out of Tuon, reimplemented the parts that mattered, and shipped it as an Obsidian plugin. No plugin API. No platform play. Just the piece that worked, running inside someone else's editor.
Why I needed voice transcription in Obsidian
Tuon already had the full loop: mic to AssemblyAI for live transcription, OpenRouter for summarization and cleanup, persistent storage. Two years of iteration. The problem was that I kept using Obsidian for my actual notes and switching to Tuon only when I needed voice. Two apps open. Two sets of notes. Context split across both.
The obvious answer was "put the voice stuff inside Obsidian." The non-obvious part was how.
Thin client vs self-contained plugin
Path A: Keep the app running and build a thin Obsidian plugin as a client. The plugin captures audio, sends it to Tuon's local gateway, gets transcripts back. This preserves the existing Python pipeline. But it also means the user needs two things running. Ports, auth tokens, firewall prompts on desktop, and on mobile it falls apart entirely. It's the kind of architecture that works great in a demo and breaks in someone's actual workflow.
Path B: Extract the slice. Port the voice pipeline into TypeScript, bundle it inside the plugin, talk to AssemblyAI directly. No local server. No second process. One install, done.
I went with B. It meant rewriting the audio capture and the streaming client from scratch. But it killed a whole class of "why isn't it connecting" problems.
The Python gateway was a liability
Here's the thing nobody warns you about: a local WebSocket server is a great architecture for development and a terrible architecture for distribution.
My original pipeline was clean. Python process accepts mic audio over a WebSocket, streams it to AssemblyAI, returns transcript events. Worked perfectly on my machine. But "install the plugin and also run this Python server" is a non-starter for anyone who isn't me. Port collisions, firewall dialogs, "which Python version," the whole mess.
I wrote a design doc — called it "gateway adaptation" — where I mapped the protocol: message shapes, audio format (16 kHz PCM16, 50 ms chunks), session lifecycle. That doc became the contract. The gateway was no longer a server; it was an interface. Audio bytes in, transcript events out. Whether that happens over a WebSocket or inside a function call doesn't matter if you nail the contract.
So the "gateway" became a TypeScript module. Same event names. Same data shapes. No network boundary.
The entire contract fits in a single type:
type TranscriptEvent =
  | { type: "session_begin"; session_id?: string }
  | { type: "transcript_update"; text: string; is_final: boolean }
  | { type: "session_terminated" }
  | { type: "error"; message: string };

That's it. Session starts, transcript text arrives (partial or final), session ends, or something breaks. The Python gateway emitted these over a WebSocket. The TypeScript module emits them through a callback. The consumer code doesn't care which.
The design doc is what made that port deliberate instead of a rewrite-and-pray.
What worked about extracting a slice
One thing to install. API keys go in Obsidian's settings tab. No "start the server first." No "check if port 8000 is free." That alone cut the support surface to almost nothing.
Vault-native storage. Transcripts live in markdown. Summaries live in markdown. Obsidian's sync, backup, and version history just work. I didn't build a persistence layer; I inherited one.
Prompts transferred directly. The summarize and prettify prompts I'd tuned in Tuon over months — same system prompts, same temperature settings — dropped straight into the plugin via OpenRouter. Zero rework on the AI side.
Scope as a feature. The plugin doesn't do everything Tuon does. It does voice-to-notes and text cleanup. That's it. Shipping one vertical instead of an entire app meant the first version was usable in days, not months.
What surprised me: audio capture was easy, storage was hard
I expected the audio capture to be the hard part. It wasn't. Obsidian runs in Electron, so the browser audio APIs are all there. The core of the mic pipeline is an AudioWorklet that buffers Float32 samples and emits Int16 PCM chunks at 16 kHz:
class Pcm16Processor extends AudioWorkletProcessor {
  constructor(options) {
    super();
    this.chunkSize = options.processorOptions.chunkSize || 800; // 50ms
    this._buf = new Float32Array(this.chunkSize);
    this._pos = 0;
  }

  process(inputs) {
    const input = inputs[0]?.[0];
    if (!input) return true;

    for (let i = 0; i < input.length; i++) {
      this._buf[this._pos++] = input[i];
      if (this._pos === this.chunkSize) {
        const out = new Int16Array(this.chunkSize);
        for (let j = 0; j < this.chunkSize; j++) {
          const s = Math.max(-1, Math.min(1, this._buf[j]));
          out[j] = s < 0 ? s * 0x8000 : s * 0x7fff;
        }
        this.port.postMessage(out.buffer, [out.buffer]);
        this._pos = 0;
      }
    }
    return true;
  }
}

That's the whole conversion: Float32 audio in, Int16 PCM out, chunked at 800 samples (50 ms at 16 kHz). Wire it to getUserMedia and an AudioContext at 16 kHz, and you've got a streaming mic pipeline in the browser. No native dependencies. No binary builds.
The actual hard part was storage. Where does a transcript live inside a markdown note? A code block? Front matter? A hidden div? I ended up with a pattern I'll write about separately: a fenced code block processed via Obsidian's MarkdownCodeBlockProcessor for metadata and a hidden HTML block for content. Getting that right took more iteration than the entire streaming pipeline.
The other surprise was mobile. I assumed the plugin would be desktop-only because of mic access. Turns out Obsidian mobile runs in a WebView that supports getUserMedia. The plugin works on phones. I didn't plan for that. Happy accident.
The playbook
This isn't a one-off. I've done it again with a research plugin (Deep Research) — same idea, different slice. The pattern:

Pick the vertical that's useful outside your app.
Document the interface contract. Not the implementation — the protocol. What goes in, what comes out.
Reimplement inside the host (Obsidian, VS Code, whatever). Accept that you'll rewrite some things.
Ship a single artifact. One install. No dependencies outside the host.

The temptation is to build a platform first — plugin API, SDK, docs, the whole thing. For a solo builder, that's months of work before anyone can use anything. Extracting a slice is faster and tells you whether the feature is worth building a platform around.
Start with the smallest useful piece. Ship that. Then decide if you need the rest.

Read on the web

                            Don't miss what's next. Subscribe to DoDataThings:

            Email address (required)