Say Hello to Hello Transcribe
Introduction
In my previous blog post I described how I created a WhatsApp bot to transcribe voice notes using OpenAI Whisper.
The next step would have been to make it available for anyone to use. That would require getting a new phone number and setting up infrastructure with enough processing power for a good user experience.
Then I discovered that Georgi Gerganov had created a C++ implementation of Whisper. I tried it out and it was blazingly fast, MUCH faster than the Pytorch implementation.
Right at the start I hoped that I could run Whisper on my iPhone without having to deal with servers and the WhatsApp Business hoops to jump through, but it seemed impossibly complex… now there was hope.
Whisper on iOS
This was a fun journey with a lot of open browser tabs. There were some challenges, especially as I’m not that familiar with the SwiftUI/XCode ecosystem.
My goal was to create
- an MVP app
- that could be distributed on the App Store
- which uses the iOS file sharing features.
So the app would not record or import audio directly, but a user would simply share the audio from another app like WhatsApp of Voice Memos.
Some challenges included:
- Getting the C++ code to compile and execute in an XCode macOS command line project, using a hard-coded model and existing WAV file.
- Figuring out how iOS handles file type using sharing and Apple’s Uniform Type Identifiers. I still have some questions here.
- Handling the Swift/C++ interaction. This is particularly problematic when using callbacks :/ You need to jump through some interesting hoops to work around the type safety in Swift and it’s closures. There’s a Medium post here on this topic if this interests you.
- Adding audio conversion to the PCM16 format required by Whisper.
- Actually building something useful in SwiftUI instead of toy projects.
That being said, I really enjoy working in the Apple/iOS/XCode ecosystem.
The Result
Hello Transcribe is available for free on the App Store.
Here’s a demo of transcribing the audio below on my iPhone 12 Pro. It takes about a second:
Try it out and let me know what you think.
Next
I’m not 100% sure, but someone has already asked to have a built-in dictation function instead of recording audio on the Voice Memo app and then transcribing it.
I also want to implement a Swift streaming transcription library so this can be re-used for voice commands in iOS/macOS/etc. app.
I’m not putting more effort into the server bot at this point.