Text To Speech / Speech To Text

Before the move I started a little Simple Voice Assistant project. It was going to basically be an Amazon Alexa-like device using a Raspberry Pi. The big pieces are Speech To Text (STT) and Text To Speech (TTS). I put it aside for a few months and now it seems everything is broken. This is mostly due to versions of software changing and updating and becoming incompatible. A common (modern) software problem. I also have grown disenchanted with OpenAI and their Whisper tool. Looking for a replacement I found a list of software that I hadn’t run across and it seems to make good recommendations for STT and TTS. Looks like I’m going to restart with VOSK and Piper.

Top 15 Open Source Speech Recognition/TTS/STT/ Systems

Duplicate City Names

I was writing some python for my little Simple Voice Assistant project. I started just wanting the current temperature for Austin. It’s available lots of places but I wanted if from an official source and didn’t want to scrape web pages. I found what I was looking for at the weather.gov pages. Some XML keying off of latitude and longitude that gives all sorts of weather data.

I got it working for Austin pretty quickly but then I decided to do some other cities. I looked up their latitude and longitude and entered them by hand in a small table. Figuring there was a better way, I found some JSON with the locations of the 1000 biggest cities in the US.

I didn’t really need all 1,000 (1,002 actually) but why not? But when I loaded them into a dict there was only 925 cities. What happened to the other 77? Didn’t take long to figure there were some duplicate names out there. I would have to index by city name + state name.

Out of curiosity I pulled out the duplicates. Sort of interesting list.

Albany: New York, Georgia, Oregon
Alexandria: Virginia, Louisiana
Apple Valley: California, Minnesota
Auburn: Washington, Alabama
Aurora: Colorado, Illinois
Bartlett: Tennessee, Illinois
Beaumont: Texas, California
Bellevue: Washington, Nebraska
Bloomington: Minnesota, Indiana, Illinois
Brentwood: California, Tennessee
Burlington: North Carolina, Vermont
Charleston: South Carolina, West Virginia
Cleveland: Ohio, Tennessee
Clovis: California, New Mexico
Columbia: South Carolina, Missouri
Columbus: Ohio, Georgia, Indiana
Concord: California, North Carolina, New Hampshire
Danville: California, Virginia
Decatur: Illinois, Alabama
Dublin: California, Ohio
Everett: Washington, Massachusetts
Fairfield: California, Ohio
Fayetteville: North Carolina, Arkansas
Florence: Alabama, South Carolina
Glendale: Arizona, California
Greenville: North Carolina, South Carolina
Huntsville: Alabama, Texas
Jackson: Mississippi, Tennessee
Jacksonville: Florida, North Carolina
Kansas City: Missouri, Kansas
Lafayette: Louisiana, Indiana
Lakewood: Colorado, California, Washington, Ohio
Lancaster: California, Pennsylvania, Ohio, Texas
Lawrence: Kansas, Massachusetts, Indiana
Lincoln: Nebraska, California
Madison: Wisconsin, Alabama
Mansfield: Texas, Ohio
Medford: Oregon, Massachusetts
Meridian: Idaho, Mississippi
Middletown: Ohio, Connecticut
Midland: Texas, Michigan
Newark: New Jersey, Ohio, California
Norwalk: California, Connecticut
Pasadena: Texas, California
Peoria: Arizona, Illinois
Plainfield: New Jersey, Illinois
Portland: Oregon, Maine
Quincy: Massachusetts, Illinois
Richmond: Virginia, California
Rochester: New York, Minnesota
Roseville: California, Michigan
Roswell: Georgia, New Mexico
Salem: Oregon, Massachusetts
San Marcos: California, Texas
Smyrna: Georgia, Tennessee
Springfield: Missouri, Massachusetts, Illinois, Oregon, Ohio
St. Cloud: Minnesota, Florida
Troy: Michigan, New York
Union City: California, New Jersey
Warren: Michigan, Ohio
Westminster: Colorado, California
Wilmington: North Carolina, Delaware

Ghost in My Machine

Progress on the Simple Voice Assistant project is going well. Since it’s a software project and lots of things are changing, I haven’t put out much status. But at this point I have all the components working. I can get data from a microphone, convert it to text and match it to a command list. I can also play streaming radio stations and search a digital music library and play music. Yesterday I noticed some odd text in the logs. Maybe AI “hallucinations”. Maybe I need to turn the gain down on my microphone. Or maybe someone or something is trying to contact me. A sample:

2024-12-11 22:17:57,889 INFO TRANSCRIBED TEXT: … And the hesitant enough so far more capabilities are more messaging to join your straitling project. In this case it’s including getting moreissions about the

Voice Assistant Project

Have been looking at what it would take to make a small voice assistant similar to Amazons Alexa. We have one (several, actually) but all we use them for is streaming music and the occasional kitchen timer. I was wondering if there were some pieces out there that could be put together on, say, a Raspberry Pi to do something similar. This is very early stages but wanted to put these bit out now, mostly for future reference. I realize there are a few similar projects out there but I didnt see any that looked like what I wanted. But they are a good place to start getting ideas. Some obvious bits:

Voice Recognition: I dont have a microphone on my desktop Raspberry Pi 4 but I found ancient USB camera that has a built-in mic. It sees to do the job. I looked at a few packages for voice recognition, but they tend to be expensive and complex. The one that looks the most interesting at this time is OpenAI Whisper. It as some quirks, like having to pad everything to 30 second sound clips, but it will be fun to play with. It also seems like a stable project likely to be around a while.

Text to Speech: we will also need a way for the system to communicate back without a permanent terminal / keyboard interface. I havent done much in this area either but this used to require specialized hardware i.e. “soundcards”. I know, that was a long time ago. Looking around I found eSpeak ready for install on Ubuntu. It has a bunch of different voices and was easy to get it up and running and it seems solid. My wife said it sounded too robotic though (not her words).

Sound Output: The Raspberry Pi 5 doesnt have a dedicated sound output like the old versions, but I have a bluetooth speaker I use for playing music. It works just fine with eSpeak.

Text Search: comparing a text string to other text strings, especially for non-exact matches is not something I want to code up myself. Again, lots of stuff out there, but I want something simple and stable. I also probably want something directly in python. FuzzyWuzzy looks good and popular as does FuzzySet and the more standard SequenceMatcher. All seem to use the difflib and Levenshtein distance. Sounds good to me. There are probably some newer AI-based approaches but I want to keep it simple.

Wake Word: One problem is “waking up” the assistant. There is usually a “wake word”. This more or less means listening 24 / 7 until this word is spoken. Will have to see how this works. A simple loop will probably be good for starters, but maybe a bit wasteful. But that might not matter.

Commands: I figure a table of expected voice commands mapped to actual Linux command line commands is the easiest. Need to figure out how best to get to things like Spotify, but that is another problem. I assume running some streams will be set up using things like ffmpeg and will require a little knowledge, but it should also be pretty simple.

Music Search: I expect to have my MP3 library either locally of remotely hosted. Getting an actual song or album from test to a file(s) location might require a scan and some sort of simple database (maybe just a table). Havent thought this one out much yet.

Thats about it for now. Will probably do this in pieces on my desktop and then eventually deploy on a smaller dedicated system.