Select Page

Google Speech to Text API: A Step-by-Step Tutorial

9 Minute Read · December 3, 2018 · by Derek Pankaew

Google has one of the most accurate speech-to-text APIs. They sport a accuracy rate of approximately 94%, at a rate of $1.44 per hour of audio.

Using Google’s API can be a bit tricky, as there’s a lot of moving parts, and the documentation can be confusing. This tutorial will show you how to use Google Speech from start to finish.

Introduction to Google Speech

Google Speech is the de facto leader in the speech-to-text API space. Out of all the APIs on the market, Google has the highest accuracy rating. In addition, Google also has most of the features developers want:

  • Human readable punctuation,
  • Timestamps on a per-word basis,
  • Confidence intervals on a per-word basis,
  • Differentiation for multiple speakers,
  • Support for multiple languages,
  • Streaming and real-time transcription support.

The primary downside to Google Speech is its cost: $1.44 per hour, which is higher than IBM Watson ($0.75) and Microsoft ($1). Basically, Google Speech is for developers who are willing to pay a little more for better functionality.

Without further ado, let’s jump into the tutorial.

Note – If you already have a Google Cloud account, feel free to skip to Step 4.

Setup Step 1: Create a Google Cloud account

Google’s Speech API falls under their Google Cloud products. You cannot use Google Speech without a Google Cloud account. In a way, Google views their suite of AI products as a way to “hook” developers into using Google Cloud.

Even if you just want to use their speech-to-text technology, you still need to use Google Cloud.

You can create an account here:

http://cloud.google.com

 

Setup Step 2: Create a Project

Google Cloud projects are segmented into Projects. A project is essentially a box that stores everything related to your project.
A project contains things like:
  • Billing information,
  • Authentication Information,
  • Storage buckets,
  • Uploaded files, 
  • API Keys

These are all connected to the project rather than to your account. An account can have multiple projects, each with their own API keys and billing information.

This makes it easier to assign users and permissions, depending on the project. A developer for Project A doesn’t need access to Project B.

So, to get started, create a project from Google Cloud Console:

http://console.cloud.google.com

Setup Step 3: Generate an API Key

You’ll need an API key to access Google Speech. Once you’ve created a project, go to APIs and Services > Dashboard: 

Then select the type of API key you want to create. For this tutorial, we’ll use the Service Account Key option. 

With a Service Account Key, Google will create a JSON file with your credentials, which you can reference once your code is up and running.

This is what the exported key looks like:

Setup Step 4: Create a Google Bucket to Store Audio Files

One thing that isn’t well documented is that, if you want to transcribe longer files, you must store your files in Google Cloud. This tripped me up for a while.

Google Cloud has its own URL structure, which creates files of the “uri” type. If your audio is longer than 1 minute, you must use Google Cloud Storage to store your files. (Documentation).

Navigate to “Storage” in your Google Cloud Console:

Then, create a bucket. A bucket is a essentially a group of files. Think of it as a folder.

Important Note: Storing files in Buckets costs money. Remember to delete your files after running your speech-to-text.

Setup Step 4: Convert Your Audio to Mono Wav 16k+

Google Speech to Text only accepts sound files of a specific format. Your file must be:

  • A .wav file,
  • A single channel of audio (mono)
  • Encoded at a sample rate of 16,000+ hertz

You can convert your sound file using whatever software you want. Audacity has a nice user interface, or you can do it via ffmpeg through command line. To convert your file via ffmpeg, use:

$ ffmpeg -i input.mp3 -vn -ar 44100 -ac 1 -f wav output.wav

Note: you must have ffmpeg installed to use it via command line.

Setup Step 5: Upload Your File to a Google Bucket

Upload your file to your Google Bucket. You can do this programmatically using the Google Bucket API. To keep things simple, for this guide we’ll just upload a file manually.

Once your file is uploaded, you’ll need to set the file’s permissions to be publicly accessible (documentation). To do this, click the three dots next to your file, and select “Edit Permissions”:

In the following screen, select “User” for Entity, add allUsers for name, and “Reader” for access:

Once your file permissions are set, you’ll need to construct your uri path. Your uri path is in the following format:

gs://BUCKET_NAME/FILENAME.wav

For example, if your bucket is “speechtotext” and your file is “file.wav”, your uri would be:

gs://speechtotext/file.wav

 

Setup Step 6: Enable the Speech to Text API

By default, the speech-to-text API is disabled. If you attempt to run a speech recognition instance, you’ll receive an error.

To enable this API, go to “APIs & Services” > Library > Cloud Speech API. Click “Enable”.

API Step 1: Download the Bindings

Locate the Google Speech bindings for your specific language. For convenience, here are the links:

For the rest of this tutorial, we’ll use NodeJS. Most of the concepts will be similar across languages.

To install our bindings, we’ll use:

npm install —save @google-cloud/speech

API Step 2: Adapt the Template Code

Finally, we’re ready to get into the code. Download one of the example code templates from Google.

For this tutorial, we’ll adapt the the sample NodeJS code. This is our starting point:

// Imports the Google Cloud client library
const speech = require('@google-cloud/speech');
const fs = require('fs');
// Creates a client
const client = new speech.SpeechClient();
// The name of the audio file to transcribe
const fileName = './resources/audio.raw';
// Reads a local audio file and converts it to base64
const file = fs.readFileSync(fileName);
const audioBytes = file.toString('base64');
// The audio file's encoding, sample rate in hertz, and BCP-47 language code
const audio = {
content: audioBytes,
};
const config = {
encoding: 'LINEAR16',
sampleRateHertz: 16000,
languageCode: 'en-US',
};
const request = {
audio: audio,
config: config,
};
// Detects speech in the audio file
client
.recognize(request)
.then(data => {
const response = data[0];
const transcription = response.results
.map(result => result.alternatives[0].transcript)
.join('\\n');
console.log(`Transcription: ${transcription}`);
})
.catch(err => {
console.error('ERROR:', err);
});

This code won't work yet - it's not setup with your credentials, and the default code is setup to upload local files of 1 minute or less.

Let's go ahead and customize the code to transcribe files of any length.

API Step 3: Add Your Credentials

Move your JSON credentials from Step 3 to your project folder.

Next, you’ll need to tell Google where this JSON file is. Documentation here.

There’s two ways to do it:

  • 1) In the command line, specify the path of the JSON file. You can do this using:
    • export GOOGLE_APPLICATION_CREDENTIALS="[PATH]"
  • 2) Alternatively, you can pass the path of the JSON file as an object within your code.

I couldn’t get Method #1 to work. I’m still not sure why. #2 works though, like this:

// Creates a client
const client = new speech.SpeechClient({
projectId: 'SpeechToText2',
keyFilename: '/Users/derekpankaew/Dropbox/Javascript Programming/tutorials_GoogleSTT/SpeechToText2-a9e5a5e334ab.json'
});


Pass in your projectId and the location of the JSON we saved in Step 3.

API Step 4: Customize Your Settings

In the example code, the client parses a local file into a string with Base64 encoding. Essentially, this turns an audio file into plain text, which is then sent as a stream to Google’s server.

This only works for audios of 1 minute or less. If you try and pass larger files this way, the Google client will throw an error.

So to start, we remove this:

// Reads a local audio file and converts it to base64
const file = fs.readFileSync(fileName);
const audioBytes = file.toString('base64');

// The audio file's encoding, sample rate in hertz, and BCP-47 language code
const audio = {
content: audioBytes,
};

And replace it with:

// Specify the file location in Google Bucket

const gcsUri = 'gs://speechtotextdemo1209381/5mins.wav';
const audio = {
uri: gcsUri,
};

Finally, we setup our options to match our sound file:

const config = {
encoding: 'LINEAR16',
sampleRateHertz: 16000,
languageCode: 'en-US',
enableAutomaticPunctuation: true,
enableSpeakerDiarization: true
};

Once this is done, we can run our file and get our result:

The result that Google returns is, by default, a wall of text. Adding “enableAutomaticPunctuation” causes Google to put in periods and question marks.

Speaker Diarization is used for multiple speakers, to differentiate between who’s speaking.

Here's our final completed code, after all our modifications. You should be able to run this and get a transcript back:

// Imports the Google Cloud client library
const speech = require('@google-cloud/speech').v1p1beta1;
// Creates a client
const client = new speech.SpeechClient({
projectId: 'SpeechToText2',
keyFilename: '/Users/derekpankaew/Dropbox/Javascript Programming/tutorials_GoogleSTT/SpeechToText2-a9e5a5e334ab.json'
});

const gcsUri = 'gs://speechtotextdemo1209381/5mins.wav';
const encoding = 'LINEAR16';
const sampleRateHertz = 16000;
const languageCode = 'en-US';

const config = {
encoding: encoding,
sampleRateHertz: sampleRateHertz,
languageCode: languageCode,
enableAutomaticPunctuation: true,
enableSpeakerDiarization: true
};

const audio = {
uri: gcsUri,
};

const request = {
config: config,
audio: audio,
};

// Detects speech in the audio file. This creates a recognition job that you
// can wait for now, or get its result later.
client
.longRunningRecognize(request)
.then(data => {
const operation = data[0];
// Get a Promise representation of the final result of the job
return operation.promise();
})
.then(data => {
const response = data[0];
const transcription = response.results
.map(result => result.alternatives[0].transcript)
.join('\\n');
console.log(`Response: ${JSON.stringify(response)}`)

console.log(`Transcription: ${transcription}`);
})
.catch(err => {
console.error('ERROR:', err);
});

Wrapping Up

That's Google's Speech to Text API in a nutshell. If you want to use it in streaming, you can keep the Base64 encoding. If you want to use it via file upload, use the Google Bucket file upload.

Questions? Comments? Thoughts? Just post in the comments below!