Python Google Speech to Text API implementation

Using Google's Speech to Text API with Python to transcribe audio files

This constantly seems to be a request on Stack Overflow, and the fact that documentation for Google’s Speech API is practically non-existent, I have decided to share an implementation of it with everyone.  If you just want the source code here you go.

Google Speech API Supported File Types

First off, your audio must first be encoded in the FLAC audio format for Google’s Speech API to accept it.  We will not be transcoding audio in the Python script, so you will have to do it before hand. If you need an easy to use tool to convert your audio files, give fre:ac a try.  It is a free, open-source, converter for Windows, Mac OS X, Linux, and FreeBSD.  Alright now on to the good stuff.

FLAC Basics

It is really useful to be able to pull out some information from our FLAC files with the Python script so that we don’t have to worry about a 3rd Party library or application.  Luckily for us, FLAC is an open source format with really clear specifications.  The only information we really need to extract can all be found in the STREAMINFO METADATA_BLOCK.  The basic diagram of what we need out of the FLAC file looks like this:

fLaC file overview

The first METADATA_BLOCK is ALWAYS the STREAM_INFO block.  The first bit in the METADATA_BLOCK_HEADER simply marks whether this METADATA_BLOCK is the last one.  The next 7 bits mark the BLOCK_TYPE, and then the last 24 bits mark the length of METADATA to follow the header.

METADATA_BLOCK_HEADER

So that should be easy enough to at least get started with confirming that the file we send to our script is actually a FLAC file.  Let’s see if we can start reading some info from FLAC files

Pretty easy right?  the python ord() function returns the number of a byte string — the inverse of chr()  (which turns a number into an ASCII character).  Since STREAMINFO is never the last block, the first 8 bits of the METADATA_BLOCK_HEADER should be 0’s for the STREAMINFO block. 0 for the first bit, and 0’s for the next 7 bits.

STREAMINFO BLOCK

So now that we can determine what is and is not a FLAC file, let’s go ahead and start pulling out the information we need.  We only need to tell Google the Sample Rate of the file, however since we are already here why don’t we get as much information as possible.

The STREAMINFO_MEATADATA_BLOCK looks like this:

STREAMINFO METADATA_BLOCK_HEADERSo going through and parsing this block of data is pretty straight forward. Let’s go ahead and just store the information as we go through them.  When it comes to handling raw binary data in Python, struct is a very useful library!  We should use it! The most useful feature will be the unpack() method.  Make sure you consult the Format Characters chart if you’re not sure what they mean.  All numbers are big-endian, so we are using the ‘>’ character before our Format Characters.

That’s all the information we need from the FLAC file, how about we turn this into a Class so that it is easier to use.  We will rename some things as well as add some error messages so we can catch errors before trying to send the file to Google.

Perfect, now to get all of the information about a FLAC file we can call the class like this on the file’s object

Now we can pull out the information we need like

Google Speech to Text API Basics

Now that we can get the information we need out of a FLAC file, we can send it to Google for transcription.  There exist a couple of endpoints for the Google Speech to Text API; we will be using Google’s full-duplex API.  The full-duplex version does not have a limit on file size, or length, and is what Chrome uses for their fancy Web Speech API.  The only problem with the full-duplex one is it does require an API key to use, and can be a little tricky due to the fact that Google has not made ANY documentation available for it.  For more details about the other API endpoints, or how to get an API key, see my earlier post about it.

The steps to call Google’s Speech to Text API are

  • Connect the Download stream – https://www.google.com/speech-api/full-duplex/v1/down
    • Parameters:
      • pair
  • Connect the Upload stream –  https://www.google.com/speech-api/full-duplex/v1/up
    • Parameters:
      • key: API Key
      • pair: A random string of Letters and Numbers used to join to the Upload stream
      • lang: What language i.e. “en-US”
      • continuous: keep the connections open
      • interim: send back information as it becomes available (before final: true)
      • pFilter*: Profanity filter (0: none, 1: some, 2: strict)
      • There are a lot of other options like grammar for specifying a particular grammar engine.  To be honest, I have no idea what grammars are available.
  • Upon successful connection, Google will send back an empty result: {“result”:[]}
  • Keep the Download & Upload stream open until Google Finishes responding
  • Google will signal the transcription is finished with a final: true tag in the JSON object
    • Files with long silences in them will have multiple final: true sections returned!
    • Google will also send a response to the Upload stream connection to signal there is nothing else to process
  • Close the Upload Stream
  • Close the Download Stream

The tricky part is getting the Download and Upload streams to function simultaneously and asynchronously. Sounds like a perfect job for Python Threads and the incredibly useful Requests library!

Let’s go ahead and start with putting this into a Class as well.  We will need to store the result in an array in case our audio file has a large gap of silence between audio, as Google will send multiple transcriptions for each part.  We can use our fLaC_Reader class to pull out the information we need as well.  For the Upstream, we also need to set a ‘Content-Type’ header with the value of ‘audio/x-flac; rate=OUR SAMPLE RATE’.  If the rate does not match the file, then Google will send back some very strange results as they do not process it correctly

Some other functions that we need in the class to make it work are

This generates a random pair value for us to use.  The next one we need is the one that yields the data for the Upstream

It goes through the file in manageable blocks based on the BlockSize and bitsperSample, making it a multiple of 8.  Once it is done uploading, since we can’t easily check the upstream for a response from Google we can use the interim results being returned from Google to gauge when Google is done with the file. If we don’t hear anything after 2 seconds after Google has started sending results, it’s pretty safe to assume we’re done.  In the meantime, we need to keep the Upstream open by sending Google dummy data. In this case it’s just a string of 8 zeros.

The last helper function we need is one to help us check if a response is a final:true response.

Okay.  Now to the controlling the Threading and the Upload and Download streams.  How about a function aptly named start()? This will kick off the threads for the Upload and Download streams. It will also close the threads once the streams are finished.  We are using a Request Sessions for each Thread, as a Session is not Thread safe– although it usually works most of the time if you use only one Session for both the upstream and downstream.  We can also skip the whole hassle of sending the file to Google if it’s not a valid FLAC file.

The stop() function is simple.  We call join() on both of the threads.  Join() will simply wait for the threads to finish, and then close them.  The rest of the python script will wait for the Join() to finish before executing the next line.  So we can actually time how long Google takes to Transcribe an audio file.

Okay, now for the upstream process.  Using requests, it’s pretty simple.   We don’t need to capture any of the results of the upstream connection, but we should automatically retry if the connection is unsuccessful:

The Downstream is a little more complicated. We are going to be mimicking the Deterministic Finite Automata that Chrome uses for interacting with the Speech API, but in a much more basic and crude way.  We need to know what state we are in, 1. Not Connected, 2. Successful Connection, 3. If we have received a second empty response after a Successful Connection,  4. If we have started receiving responses, and 5. If we have a final result.  With the stream=True parameter we can access the responses as they come with the iter_lines() method.  We store the response in the self.response variable, as that is what the self.final() function checks. If it’s a final result we add it to our results, and keep going. Our Downstream will keep listening for responses until the Upstream is closed. The upstream will only close when the Downstream quits receiving responses.  So every time we get a result back from Google we reset the timer to keep the Upload stream going.  If we don’t hear anything for 2 seconds, the Upstream is closed, and the Downstream finishes. Pretty cool huh?  Just like the upstream we are also going to restart the connection if it was unable to connect.

 

Well, that is all you need to have audio files transcribed by Google!  Let’s write a quick script that utilizes our two new classes!

I want to time how long it takes so I’m going to write a quick Class for timing code:

I will also need to add a couple more imports to make everything work at the top.

Now I can simply open a file, start the transcription, and time it!

The output looks like this:

 

Well, I hope you found this tutorial enlightening.  Special thanks go out to the Chromium Project for making all of their code available 🙂

 

Here is the complete source code / working example (minus an API key).

 

Happy Transcribing!

Travis Payton
Follow Me

  16 comments for “Python Google Speech to Text API implementation

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: