From Audio to Action Items in Under a Minute

AI Productivity Tools Go

I stopped taking meeting notes. Not because I got lazy — because I automated the entire chain from raw audio to structured action items, decisions logged, and references updated.

The setup has two parts: a Go CLI tool called meetingcli that handles recording and transcription, and an AI agent skill that processes the transcript into my personal operating system. End to end, a one-hour meeting goes from audio to fully processed in about a minute after I hang up.

The Recording Problem

Meeting notes are lossy. You’re either paying attention or writing things down — rarely both. And even good notes miss nuance, exact phrasing, and the stuff you didn’t realize was important until later.

The alternative is recording, but most tools dump you with a giant audio file or a mediocre transcript locked inside some SaaS platform. I wanted something local, private, and plugged into my existing workflow.

meetingcli

meetingcli is a Go CLI that does three things: record, transcribe, and summarize. One command, no UI.

meeting start --name "product-sync"
# talk for an hour
# Ctrl+C to stop
# → transcription and summary happen automatically

Dual Audio Capture

The tricky part of meeting recording on macOS is capturing both sides of the conversation. meetingcli runs two audio streams in parallel:

  1. System audio via Apple’s ScreenCaptureKit — this captures whatever’s coming out of your speakers or headphones. It taps directly into the OS audio mixer, so it works with Bluetooth headphones, external DACs, whatever. The implementation is Objective-C bridged into Go via cgo.

  2. Microphone audio via ffmpeg from the default input device.

Both streams record to separate WAV files, then get merged into a single recording.wav using ffmpeg’s amix filter. The system audio capture does real-time downsampling from 48kHz stereo float32 to 16kHz mono int16 — keeps file sizes reasonable without losing speech clarity.

Each meeting lands in its own timestamped folder:

~/meetings/2026-02-06_14-00-00_product-sync/
├── recording.wav      # merged audio
├── system.wav         # system audio (raw)
├── mic.wav            # mic audio (raw)
├── transcript.md      # diarized transcript
└── summary.md         # AI-generated summary

Transcription

The merged audio gets sent to Mistral’s Voxtral API with speaker diarization enabled. The transcript comes back with speaker labels, so you get a readable conversation rather than a wall of text:

**Speaker 0:**
We need to decide on the auth model before shipping the API redesign.
 
**Speaker 1:**
I think JWT with refresh tokens is the way to go. OAuth adds complexity we don't need yet.

Voxtral handles multilingual audio well — useful when half your meetings are in German and the other half in English.

Summary

After transcription, the full text goes to Claude Haiku 4.5 for summarization. This produces a quick-reference summary with key topics, decisions, and action items. Useful for a glance, but the real value comes from what happens next.

Processing Transcripts Into the System

Recording and transcribing is table stakes. The interesting part is what happens to the transcript after.

I have an AI agent skill called process-meetings that reads transcripts and updates my entire operating system accordingly. When I run it, it spawns a parallel sub-agent for each meeting recorded that day. Each sub-agent reads the transcript and does six things:

1. Creates a Meeting Note

A concise summary goes into the relevant project folder — participants, key topics, decisions, action items with owners. Not a rehash of the transcript, but the stuff that matters a week from now.

2. Updates the Compass

The compass is my AI’s memory system. The agent updates decisions.md when significant decisions were made, updates the context graph when project status changed, and captures any preferences I expressed.

If I said “let’s drop the OAuth requirement and go with API keys” in a meeting, that decision gets logged with reasoning. Next time AI helps me with the auth system, it knows what was decided and why.

3. Updates References

Project briefs, people files, and other reference documents get updated when durable understanding changes. New team member mentioned? Their file gets created. Project scope shifted? The brief gets updated.

4. Flags Tasks

This one is deliberately conservative. The agent only creates tasks that are directly my responsibility — not things assigned to other team members, not things I just need to be aware of. It checks existing tasks first to avoid duplicates.

5. Updates Collections

If I mentioned a restaurant I liked or a hike worth doing, it gets added to my collections. Small thing, but it means casual mentions don’t get lost.

6. Zettelkasten Connections

Occasionally a meeting surfaces a genuinely interesting concept or pattern. The agent checks if it connects to existing notes in my knowledge base. Most meetings don’t produce anything here — that’s by design.

The Full Pipeline

Here’s what a typical day looks like:

  1. Join a meeting. Run meeting start. Talk.
  2. Hang up. Ctrl+C. Transcription and summary happen automatically.
  3. At the end of the day (or whenever), run the process-meetings skill.
  4. Sub-agents process each meeting in parallel. My compass, references, tasks, and notes all get updated.

Total manual effort: typing meeting start and pressing Ctrl+C.

Why This Works

Three design choices make the difference:

Local-first recording. The audio files live on my machine. No uploading hour-long recordings to some startup’s servers. The only thing that leaves my machine is the audio going to Mistral for transcription and the text going to Anthropic for summarization.

Separation of capture and processing. meetingcli does one job well: turn audio into text. The AI agent does the interpretation. This means I can improve either side independently. Better transcription model? Swap the API call. Better processing logic? Update the skill prompt.

Transparent output. Everything is markdown files in known locations. No database, no proprietary format. The meeting folders are just folders. The compass updates are just file edits. I can inspect, override, or correct anything.

Building meetingcli

The tool is open source and installable via Homebrew:

brew install devbydaniel/tap/meeting

The Go codebase follows hexagonal architecture — the recording, transcription, and summarization are separate use cases with clean interfaces. The macOS audio capture is the most interesting piece technically: an Objective-C implementation of ScreenCaptureKit bridged into Go, handling real-time audio format conversion in the capture callback.

If you’re curious about the implementation or want to adapt it for your own workflow, the source is on GitHub.

What’s Missing

Speaker identification. Voxtral gives you “Speaker 0” and “Speaker 1,” not names. I could map these manually or build a voice fingerprinting step, but honestly the context usually makes it obvious who said what.

The summary step inside meetingcli is redundant with the agent processing. I might remove it or make it optional — the agent produces better, more contextualized output anyway.

And the processing skill currently requires me to trigger it manually. It could watch the meetings directory and process automatically when a new transcript appears. Haven’t needed that yet — running it once at end of day works fine.

Imprint

This website is created and run by

Daniel Benner

Zur Deutschen Einheit 2

81929 München

Germany

hello(a)danielbenner.de