Select Page

IBM Watson Speech-to-Text API: a Step by Step Tutorial

8 Minute Read · October 21, 2018 · by Derek Pankaew

IBM Watson was the first AI to dazzle the world by defeating Jeopardy’s world champion, Ken Jennings, in 2011.  After his spectacular defeat, Jennings said: “I for one welcome our new computer overlords.”

Of course, the Watson of today is much more advanced. Today’s IBM Watson runs on a variety of neural networks, and is able to complete a wide range of complex intelligent tasks.

Speech-to-Text is one of those core features. This tutorial will show you how to use IBM’s speech recognition technology, from start to finish. 

Introduction to IBM Watson Speech-to-Text

Watson’s API is a good option for those wanting a lower cost option than Google or Amazon’s speech recognition APIs.

IBM’s accuracy is slightly lower than their competitors, and IBM doesn’t support punctuation. Transcripts without punctuation are difficult to read, so most customers are likely not using it for human reading but for programmatic reasons (i.e. data mining, subtitles, etc.)

IBM Watson has one big advantage over Google and Amazon though: the cost is $0.75 per hour, rather than Google and Amazon’s $1.44 per hour.

Without further ado, let’s get started.

Setup Step 1: Sign up for IBM Cloud

Using IBM Watson requires that you have an IBM Cloud account. Sign up or login here:

https://www.ibm.com/watson/services/speech-to-text/

Once you’re registered, navigate to Watson’s cloud interface:

 

Note: Why did IBM create both “Bluemix” and IBM Cloud? I have no idea. They seem to jump back and forth between the two domains randomly.

Setup Step 2: Select Your Pricing Plan

Select your Speech-to-Text pricing plan. Keep in mind that the region you select will influence your price. Use the dropdown menu here to check pricing in different regions. At the time of writing, US-East is the lowest cost.

Once you’ve selected your plan, hit “Create”. For this tutorial, we’ll use the Free plan.

Setup Step 3: Validate Your Credentials

Once you select your plan, you’ll need to create a project. This process is pretty self-explanatory.

After creating the project, you’ll receive your credentials. It’ll look something like this:

To validate your credentials, use IBM’s test file here:

curl -X POST \\
-u "{username}":"{password}" \\
--header "Content-Type: audio/flac" \\
--data-binary @{path_to_file}audio-file.flac \\
"{url}/v1/recognize”

Replace Username, Password, URL, and path to file.

You should receive a response with your transcription. Congratulations, you’ve just transcribed your first audio via Watson!

Now, let’s get to using Watson programmatically.

Setup Step 4: Find Your Bindings

Locate the bindings for your specific language. For convenience, here are the links:

For the rest of this tutorial, we’ll use NodeJS. Most of the concepts will be similar accross languages.

To install our bindings, we’ll use: 

npm install watson-developer-cloud

 

Create a New Transcription Job

The “IBM Developer Cloud” authentication section is a bit intimidating. Fortunately, Speech to Text seems to have its own authentication, which just works with username and password.

So all we need to do is replace our Username, Password, and file path from the template code provided here:

var SpeechToTextV1 = require('watson-developer-cloud/speech-to-text/v1');
var fs = require('fs');
var speechToText = new SpeechToTextV1({
username: '<username>',
password: '<password>',
url: 'https://stream.watsonplatform.net/speech-to-text/api/'
}); var params = {
// From file
audio: fs.createReadStream('./resources/speech.wav'),
content_type: 'audio/l16; rate=44100'
}; speechToText.recognize(params, function(err, res) {
if (err)
console.log(err);
else
console.log(JSON.stringify(res, null, 2));
});

Once you’re ready, just run the script and you should get a response with your transcription.

By default, the Speech Regonition API will return an object, which has a bunch of arrays, one after another, with your text inside. The arrays have an object within them, containing the confidence score and the transcript.

Key Parameters and Options

Here are the most important options you should know about.

Language Model

The language model determines your language, and how the speech API recognizes your audio. The most important distinction is the difference between broadband and narrowband. Broadband is basically high quality audio, and narrowband is low quality audio. Use broadband for 16kHz+ audio, usually microphone or smartphone recordings. Use narrowband for 8kHz recordings, usually from phone calls.

More information here.

Multiple Speakers

If there are multiple speakers in the audio, Watson can try to differentiate them for you. To enable it, set speaker_labels to true.

Formatting Numbers, Dates, Time, and Currency

By default, numbers are not formatted. For example, a grocery store interaction might be transcribed as:

“Your order will be fifteen dollars and twenty cents, with a ten percent discount.”

To have these numbers formatted, enable Smart Formatting by setting smart_formatting to true. The above sentence will now be formatted to:

“Your order will be $15.20, with a 10% discount.”

Timestamps

Timestamps returns each word with its start and stop time.

"timestamps": [
   [
      "hi",
      0.68,
      1.19
   ],
   [
      "everyone",
      1.33,
      2.52
   ],
   [
      "welcome",
      2.98,
      3.86
   ]

To receive timestamps, set timestamps to true.

These were the most common options I needed to configure. For a full list of parameters, look here. To pass in the parameters, just append them to the params parameter:

var params = {
// From file
audio: fs.createReadStream('./5mins.wav'),
content_type: 'audio/l16; rate=16000',
model: "en-US_NarrowbandModel",
speaker_labels: true,
timestamps: true,
smart_formatting: true
};

Questions or Comments?

Now you know how to use IBM Watson to turn audio into speech.

Have questions? Comments? Leave a comment below – we respond to every comment!