Self Hosting Text-to-Speech AI for Research and Fun
We built the Three Word Stories app in under 48 hours during Pointless Palooza, using XTTSv2 for AI-powered narration via Docker containerization. Our CPU-only setup enables custom storytelling without high GPU costs. Here's how we did it.
During our recent Pointless Palooza, an internal hackathon-style event, we built an app called Three Word Stories. We wanted to see if, in just 48 short hours, we could manage to build an application that included its own AI-powered narration for the stories our users create. Our goal was to let our users not just write stories, but hear them read aloud.
There are quite a few options for pre-baked APIs from the likes of OpenAI that could have handled the narration for us, but where is the fun in that? Not only is building it on our own more exciting, but it would help us ensure the game could be played by as many people as possible without expensive narration generation potentially eating up an already limited budget.
To make this possible and turn our AI of choice into a reproducible and easily deployable system, we used containerization — specifically, Docker. One of the early challenges we ran into is that, though there are plenty of guides/documentation that tell you how to run a model directly or hack it together on your local system, there’s very sparse information about packaging everything up. We also decided to stick within the constraint of CPU processing to keep costs low (GPU time in the cloud is expensive), which means that we traded off some clarity and speed of narration for lower costs.
Text to Speech
We spent some time ahead of Pointless looking at what options existed in the space so that during the event we could focus on the hardest part — actually building our app. There are an awful lot of models available (of varying quality) for a variety of use cases, like summarization, text-to-image, or audio classification. A good place to start (as of writing this) is with Hugging Face. They provide a very handy aggregation for different models and a helpful playground for quite a few of them.
During our search for a model, we came across Tortoise, which led us to the library we ultimately integrated with: XTTSv2 by the (now-defunct) Coqui.
The biggest advantage of XTTSv2 was that it allowed us to have a singular interface for multiple models, which drastically reduced the time it took to get to our end goal of having multiple options for narration. As part of this process, we eventually needed to pick out the underlying models to use. To do this, we wired up a temporary Pocketbase endpoint to run all of the models supported by XTTS via Goroutines. This let us test out some of the final implementation code for the app while also getting us some sample audio to evaluate which model we liked the best. We landed on four models (below) that performed decently well using a CPU-only container. They are ordered from fastest to slowest which, coincidentally, directly correlated with the quality of the output’s sound (the longer it took to generate, the better it sounded).
- ljspeech/glow-tts
- glow-tts trained on ljspeech dataset
- glow-tts github
- ljspeech/tacotron2-DDC_ph
- tacotron2 trained on ljspeech dataset
- tacotron2 github
- ljspeech/vits
- VITS trained on ljspeech dataset
- VITS github
- jenny/jenny
- VITS trained on Jenny dataset
- VITS github
Important note: These links are the best that we can locate for the underlying sources and are provided only to help you in your research. They should not be treated as definitive.
Building the Image
The first thing we want to do is to make sure that Docker is installed and running locally, either through Docker Desktop or Docker Engine. From there, we can get started on our Dockerfile. We will be using the Python Debian image as our base image for this guide as it gives us the easiest path to create a running image with a working AI model embedded in it. Please be aware that most models require a decently large amount of disk space, so, as you build your images, you may need to regularly prune old images to keep disk usage to a minimum.
The Dockerfile
# Change to nvidia/cuda:11.8.0-base-ubuntu22.04 for CUDA support
FROM python:3.10.8-slim
# Install OS dependencies:
RUN apt-get update && apt-get upgrade -y \
&& apt-get install -y --no-install-recommends \
gcc g++ \
make \
python3 python3-dev python3-pip python3-venv python3-wheel \
espeak-ng libsndfile1-dev \
&& rm -rf /var/lib/apt/lists/*
# Install Major Python Dependencies:
RUN pip install llvmlite --ignore-installed \
torch \
torchaudio --extra-index-url https://download.pytorch.org/whl/cu118 \
&& rm -rf /root/.cache/pip
# Download XTTS-v2 for text to speech processing
RUN pip install TTS
ENTRYPOINT ["tts"]
In the first part, we install our OS dependencies (some of which already exist in the image we are using) to ensure that the baseline is set up properly. We then add in the major Python dependencies for TTS specifically, and finally we can install TTS itself. For now our entrypoint is set to tts
so that when we run the image, it will be like having TTS installed locally and directly calling it via the command line.
Once we have the Dockerfile saved to a directory, we can open a terminal in the same directory and build the image with docker build --platform linux/amd64 . -t tts-example
which will provide a custom TTS image tagged as tts-example
that we can use to play with the library (please note this can take some time). After the image has finished building, we can use it to generate some audio by running:
docker run --rm -v ~/tts-output:/root/tts-output tts-example \
--text "Hello World! This audio was generated with the T T S library." \
--out_path /root/tts-output/hello.wav
or to use one of the models we enjoyed:
docker run --rm -v ~/tts-output:/root/tts-output tts-example \
--model_name "tts_models/en/jenny/jenny" \
--text "Hello World! This audio was generated with the T T S library." \
--out_path /root/tts-output/hello-jenny.wav
Here is the output we got running the Jenny model:
Using With an Application
Our approach to integrating with our application was to call the library using the os/exec package in Go. An example Go script could look something like this (we are using a contrived example for brevity). Place these files into the same directory alongside the Dockerfile.
go.mod:
module viget.com/tts-example/v2
go 1.22.2
example.go:
package main
import (
"bytes"
"fmt"
"log"
"os/exec"
)
func generateNarration(text string, model_variant int, outname string) {
models := map[int]string{
0: "tts_models/en/ljspeech/glow-tts",
1: "tts_models/en/ljspeech/glow-tts",
2: "tts_models/en/ljspeech/tacotron2-DDC_ph",
3: "tts_models/en/ljspeech/vits",
4: "tts_models/en/jenny/jenny",
}
model := models[model_variant]
divider := "----------------------------------------"
fmt.Println(divider)
fmt.Printf("Generating narration with model %s\n", model)
outfile := fmt.Sprintf("/root/tts-output/%s.wav", outname)
cmd := exec.Command("tts", "--text", text, "--model_name", model, "--out_path", outfile, "--progress_bar", "False")
// Create a buffer to capture the output
var out bytes.Buffer
cmd.Stdout = &out
cmd.Stderr = &out
if err := cmd.Run(); err != nil {
log.Println("Error running command:", err)
return
}
fmt.Println(divider)
fmt.Printf("Narration complete, model: %s, command: %s, command output:\n%s\n", model, cmd.String(), out.String())
}
func main() {
text := "Hello go! This audio was generated with the T T S library."
generateNarration(text, 2, "narration-2")
generateNarration(text, 4, "narration-4")
}
Next we need to update our Dockerfile to build and run our Go script instead of just exposing the TTS library.
Dockerfile:
FROM golang:1.22-alpine AS builder
WORKDIR /tts
RUN apk add --no-cache \
unzip \
ca-certificates
COPY . .
RUN go mod download
RUN go build -o example
# Change to nvidia/cuda:11.8.0-base-ubuntu22.04 for CUDA support
FROM python:3.10.8-slim
# Install OS dependencies:
RUN apt-get update && apt-get upgrade -y \
&& apt-get install -y --no-install-recommends \
gcc g++ \
make \
python3 python3-dev python3-pip python3-venv python3-wheel \
espeak-ng libsndfile1-dev \
&& rm -rf /var/lib/apt/lists/*
# Install Major Python Dependencies:
RUN pip install llvmlite --ignore-installed \
torch \
torchaudio --extra-index-url https://download.pytorch.org/whl/cu118 \
&& rm -rf /root/.cache/pip
# Download XTTS-v2 for text to speech processing
RUN pip install TTS
COPY --from=builder /tts/example /tts/example
# Run the script
CMD ["/tts/example"]
Once the image has been built with docker build --platform linux/amd64 . -t tts-example
, we can run the image with docker run --rm -v ~/tts-output:/root/tts-output tts-example
and the output will be saved to the home folder.
Takeaways
During Pointless Palooza, we learned quite a lot about deploying one of these models into the wild. The explosion of useful AI models, in part driven by Nvidia and CUDA, has created an interesting new space in tech that is definitely worth exploring.
Model Runtime
If you can afford it, GPU-based acceleration is definitely important for improving the output from the majority of models. Local testing showed improved performance simply by enabling GPU support, but (as we were able to prove) it is possible to build something useful with only CPU processing available and just a couple of cores and 4GB of RAM. While it can be cost-prohibitive, I would strongly recommend performing some real-world tests on ephemeral infrastructure or local hardware if you just can’t get the output you are looking for with CPU only.
Containerization
Containerizing the application was absolutely critical to being able to validate that what we were building would function predictably in production. One thing that can be particularly pernicious with AI models is the “works on my machine” phenomenon. By containerizing the application, we can eliminate some of that drift. An important caveat to this is that physical hardware can have an impact on output that is less evident/present in things like TTS, but I have encountered it when running LLMs. Containerizing the application will also let you very easily move it from server to server without too much fuss.
Experimentation
Finally, there is no real replacement for direct experimentation. The current expansion in the AI field is relatively new and there isn’t a lot of consensus on best practices — new models are being released all the time. We ended up testing every model supported by TTS before landing on the ones we wanted to include in our application, and we probably missed quite a few that might have given us better results.
I hope this article helps you in your AI endeavors! I can’t wait to see what the future holds for this promising new technology.