Python Google Speech to Text API implementation

Using Google's Speech to Text API with Python to transcribe audio files

This constantly seems to be a request on Stack Overflow, and the fact that documentation for Google’s Speech API is practically non-existent, I have decided to share an implementation of it with everyone.  If you just want the source code here you go.

Google Speech API Supported File Types

First off, your audio must first be encoded in the FLAC audio format for Google’s Speech API to accept it.  We will not be transcoding audio in the Python script, so you will have to do it before hand. If you need an easy to use tool to convert your audio files, give fre:ac a try.  It is a free, open-source, converter for Windows, Mac OS X, Linux, and FreeBSD.  Alright now on to the good stuff.

FLAC Basics

It is really useful to be able to pull out some information from our FLAC files with the Python script so that we don’t have to worry about a 3rd Party library or application.  Luckily for us, FLAC is an open source format with really clear specifications.  The only information we really need to extract can all be found in the STREAMINFO METADATA_BLOCK.  The basic diagram of what we need out of the FLAC file looks like this:

fLaC file overview

The first METADATA_BLOCK is ALWAYS the STREAM_INFO block.  The first bit in the METADATA_BLOCK_HEADER simply marks whether this METADATA_BLOCK is the last one.  The next 7 bits mark the BLOCK_TYPE, and then the last 24 bits mark the length of METADATA to follow the header.

METADATA_BLOCK_HEADER

So that should be easy enough to at least get started with confirming that the file we send to our script is actually a FLAC file.  Let’s see if we can start reading some info from FLAC files

Pretty easy right?  the python ord() function returns the number of a byte string — the inverse of chr()  (which turns a number into an ASCII character).  Since STREAMINFO is never the last block, the first 8 bits of the METADATA_BLOCK_HEADER should be 0’s for the STREAMINFO block. 0 for the first bit, and 0’s for the next 7 bits.

STREAMINFO BLOCK

So now that we can determine what is and is not a FLAC file, let’s go ahead and start pulling out the information we need.  We only need to tell Google the Sample Rate of the file, however since we are already here why don’t we get as much information as possible.

The STREAMINFO_MEATADATA_BLOCK looks like this:

STREAMINFO METADATA_BLOCK_HEADERSo going through and parsing this block of data is pretty straight forward. Let’s go ahead and just store the information as we go through them.  When it comes to handling raw binary data in Python, struct is a very useful library!  We should use it! The most useful feature will be the unpack() method.  Make sure you consult the Format Characters chart if you’re not sure what they mean.  All numbers are big-endian, so we are using the ‘>’ character before our Format Characters.

That’s all the information we need from the FLAC file, how about we turn this into a Class so that it is easier to use.  We will rename some things as well as add some error messages so we can catch errors before trying to send the file to Google.

Perfect, now to get all of the information about a FLAC file we can call the class like this on the file’s object

Now we can pull out the information we need like

Google Speech to Text API Basics

Now that we can get the information we need out of a FLAC file, we can send it to Google for transcription.  There exist a couple of endpoints for the Google Speech to Text API; we will be using Google’s full-duplex API.  The full-duplex version does not have a limit on file size, or length, and is what Chrome uses for their fancy Web Speech API.  The only problem with the full-duplex one is it does require an API key to use, and can be a little tricky due to the fact that Google has not made ANY documentation available for it.  For more details about the other API endpoints, or how to get an API key, see my earlier post about it.

The steps to call Google’s Speech to Text API are

  • Connect the Download stream – https://www.google.com/speech-api/full-duplex/v1/down
    • Parameters:
      • pair
  • Connect the Upload stream –  https://www.google.com/speech-api/full-duplex/v1/up
    • Parameters:
      • key: API Key
      • pair: A random string of Letters and Numbers used to join to the Upload stream
      • lang: What language i.e. “en-US”
      • continuous: keep the connections open
      • interim: send back information as it becomes available (before final: true)
      • pFilter*: Profanity filter (0: none, 1: some, 2: strict)
      • There are a lot of other options like grammar for specifying a particular grammar engine.  To be honest, I have no idea what grammars are available.
  • Upon successful connection, Google will send back an empty result: {“result”:[]}
  • Keep the Download & Upload stream open until Google Finishes responding
  • Google will signal the transcription is finished with a final: true tag in the JSON object
    • Files with long silences in them will have multiple final: true sections returned!
    • Google will also send a response to the Upload stream connection to signal there is nothing else to process
  • Close the Upload Stream
  • Close the Download Stream

The tricky part is getting the Download and Upload streams to function simultaneously and asynchronously. Sounds like a perfect job for Python Threads and the incredibly useful Requests library!

Let’s go ahead and start with putting this into a Class as well.  We will need to store the result in an array in case our audio file has a large gap of silence between audio, as Google will send multiple transcriptions for each part.  We can use our fLaC_Reader class to pull out the information we need as well.  For the Upstream, we also need to set a ‘Content-Type’ header with the value of ‘audio/x-flac; rate=OUR SAMPLE RATE’.  If the rate does not match the file, then Google will send back some very strange results as they do not process it correctly

Some other functions that we need in the class to make it work are

This generates a random pair value for us to use.  The next one we need is the one that yields the data for the Upstream

It goes through the file in manageable blocks based on the BlockSize and bitsperSample, making it a multiple of 8.  Once it is done uploading, since we can’t easily check the upstream for a response from Google we can use the interim results being returned from Google to gauge when Google is done with the file. If we don’t hear anything after 2 seconds after Google has started sending results, it’s pretty safe to assume we’re done.  In the meantime, we need to keep the Upstream open by sending Google dummy data. In this case it’s just a string of 8 zeros.

The last helper function we need is one to help us check if a response is a final:true response.

Okay.  Now to the controlling the Threading and the Upload and Download streams.  How about a function aptly named start()? This will kick off the threads for the Upload and Download streams. It will also close the threads once the streams are finished.  We are using a Request Sessions for each Thread, as a Session is not Thread safe– although it usually works most of the time if you use only one Session for both the upstream and downstream.  We can also skip the whole hassle of sending the file to Google if it’s not a valid FLAC file.

The stop() function is simple.  We call join() on both of the threads.  Join() will simply wait for the threads to finish, and then close them.  The rest of the python script will wait for the Join() to finish before executing the next line.  So we can actually time how long Google takes to Transcribe an audio file.

Okay, now for the upstream process.  Using requests, it’s pretty simple.   We don’t need to capture any of the results of the upstream connection, but we should automatically retry if the connection is unsuccessful:

The Downstream is a little more complicated. We are going to be mimicking the Deterministic Finite Automata that Chrome uses for interacting with the Speech API, but in a much more basic and crude way.  We need to know what state we are in, 1. Not Connected, 2. Successful Connection, 3. If we have received a second empty response after a Successful Connection,  4. If we have started receiving responses, and 5. If we have a final result.  With the stream=True parameter we can access the responses as they come with the iter_lines() method.  We store the response in the self.response variable, as that is what the self.final() function checks. If it’s a final result we add it to our results, and keep going. Our Downstream will keep listening for responses until the Upstream is closed. The upstream will only close when the Downstream quits receiving responses.  So every time we get a result back from Google we reset the timer to keep the Upload stream going.  If we don’t hear anything for 2 seconds, the Upstream is closed, and the Downstream finishes. Pretty cool huh?  Just like the upstream we are also going to restart the connection if it was unable to connect.

 

Well, that is all you need to have audio files transcribed by Google!  Let’s write a quick script that utilizes our two new classes!

I want to time how long it takes so I’m going to write a quick Class for timing code:

I will also need to add a couple more imports to make everything work at the top.

Now I can simply open a file, start the transcription, and time it!

The output looks like this:

 

Well, I hope you found this tutorial enlightening.  Special thanks go out to the Chromium Project for making all of their code available 🙂

 

Here is the complete source code / working example (minus an API key).

 

Happy Transcribing!

Travis Payton
Follow Me

Travis Payton

Is a Computer Scientist and Japanese scholar who enjoys programming, video games, and living life.He currently works at the University of Alaska Fairbanks, and does freelance programming and translation work on the side.
Travis Payton
Follow Me

  15 comments for “Python Google Speech to Text API implementation

  1. searchingfortao
    November 11, 2016 at 6:30 am

    This is a cool project, but you didn’t post a license of the code. Are you cool with me including it in a GPL project?

    • November 11, 2016 at 9:47 am

      I basically just examined Chromium source code and the FLAC documentation. So I didn’t even think about licensing it. Feel free to use it! If anything I’m thinking this would fall under the MIT license as I have no plans to maintain it or develop it passed the proof of concept. I would be flattered with just a mention in a comment if this helped you in any significant way. Thanks for asking!

  2. October 7, 2015 at 12:53 am

    Hi Travis Payton
    Below are the issues which i faced while transcripting an audio file of 124 sec.
    -No handlers could be found for logger “main
    -RuntimeError: cannot join thread before it is started
    -ValueError: I/O operation on closed file
    -NameError: global name ‘RuntimeException’ is not defined

    I could get some transcripts but not complete..its only the half and found many repeated transcriptions

    • Shub
      December 25, 2015 at 6:17 am

      Hey Raghuvaran , I too am getting the same errors . Were you able to find any solution for the same . I am unable to get any STT operation done .
      Travis Please help.

  3. October 6, 2015 at 10:39 pm

    Excellent work..! its awesome ! Its working well…. Thanks!!!

  4. July 18, 2015 at 3:11 am

    You have a wonderful article! But, for some reason, the script does not want translate files over a minute long 🙁

    What do you think, why?

  5. Anonymous
    May 14, 2015 at 11:42 am

    Good tutorial. But I have one problem: your code don’t recognize words uft-8 from flac files. How I solve this problem? I MUST solution for academic project. Best Regards.

    • romao
      May 15, 2015 at 2:17 am

      I added # –– coding: utf-8 –– on first line your code. I tried print “accents:áéíóúãõç” and worked right. But results from Google with words utf-8 didn’t fix.

      My flac I had: “o meu carro é amarelo” (pt-PT) equals “my car is yellow” (en-US)
      Results Google Speech: “[‘{“result”:[{“alternative”:[{“transcript”:”o meu carro \xc3\xa9 amarelo”,”confidence”:0.71846169}],”final”:true}],”result_index”:0}’]”

      I saw this table: http://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&utf8=string-literal&unicodeinhtml=hex and \xc3\xa9 corresponds “é”. But this method isn’t good practice. It is possible fix this problem?

      Best Regards.

      • May 16, 2015 at 1:16 pm

        Hi romao,

        You shouldn’t need the #– coding: utf-8 — on the first line. The library is handling all of the unicode correctly, its just the final output function that I wrote wasn’t really setup for Unicode or for anything other than a quick glance at the contents of the audio. The reason it is printing with the hex values of the unicode characters is it is actually printing a list / array of strings. Google will send back multiple results if there are large gaps of silence in your audio, these separate results are each stored in the result array. If you print a single element from the list then you’ll see that the unicode is displayed correctly, i.e. print result.result[0] Here is a cleaner way to get the final output as a JSON object in case you still need to manipulate it later, or would like to be able to read it because it contains unicode:

        Simply add / update this at the bottom of the main function

        • romao
          May 19, 2015 at 10:03 am

          Hi again. I tried your code but didn´t work in me. I deleted #– coding: utf-8 — on the first line and your code didn´t work. I MUST use #– coding: utf-8 — on the first line and I fixed Unicode words in script PHP I created.

          • romao
            May 19, 2015 at 10:37 am

            Sorry. I tried again and your code fixed Unicode words, but ONLY works with adding #– coding: utf-8 — on the first line. Thanks a lot.

  6. alexrbigelow
    March 19, 2015 at 11:00 am

    Holy crap! Small world! I just finished skimming this and had no idea who had written it!

    I was so blown away by the truly rare balance of being thorough and yet easy-to-read, I had to check out who the author was… lo and behold, not only do I know this dude, it’s my father!

    (strange background on what led me here: I’m trying to transcribe the audio from all the episodes of The Joy of Painting for a data mining project…)

    • Tom
      March 17, 2016 at 4:27 am

      I’m curious if he is your biological father, because his last name is different from your’s and looking at a photo of him he looks younger than you.

      • March 17, 2016 at 4:49 am

        Sorry, my comment is probably confusing… we were LDS missionaries together in Otaru, Japan. He was my first companion (missionaries are always in companionships of two or three), and, as they are largely responsible for training you, your first companion always makes a big impression on how you serve as a missionary. We would always joke about your trainer being your “father” because the influence is so profound.

        I was really, really lucky to have Travis as a trainer—he was really well-known in the mission for loving the people he served unconditionally, and I did my best to emulate that.

        It’s pure coincidence (hence my shock) that we both happen to work on the same sort of problems professionally.

        • sutekidayo
          March 17, 2016 at 4:00 pm

          Awww Thanks! That really made my day this morning.

Leave a Reply

%d bloggers like this: