Hands-on with Google Cloud Text-to-Speech

The 1902 cloud release of SAP has introduced the ability to use Google Cloud Text-to-Speech for generating narrative audio for Enable Now. Prior to this, Enable Now relied on the underlying Microsoft Windows Text-to-Speech functionality, which itself does a more-than-adequate job.

Using the Google Cloud Text-to-Speech is relatively easy, but it is nothing like using the previous Text-to-Speech functionality in Enable Now (which is still there – and we’ll now refer to as “Windows Text-to-Speech” for easier differentiation). In fact, it looks and feels like an entirely separate piece of functionality, even hanging off different menu option. This is probably wise, as it really is a separate piece of functionality, and you probably won’t want to confuse the two, for reasons I’ll explain below.

First, it is important to note that Google Cloud Text-to-Speech is a paid-for service. When you subscribe, you get an API Key, and you need that to be able to use the service. This is actually relatively cheap, at (currently) US$4.00 per million characters (which works out at around 300 pages of text) for ‘standard’ voices and US$16.00/mc for ‘WaveNet voices, and you are only billed in increments of $100 as you reach that threshold. This is actually a complete bargain; Google’s pricing model probably assumes that you are using the cloud service, and converting in real-time, when the content is played – so if you have a lot of users all playing this content, the cost to you (the API key owner) adds up with every play. But the way Enable Now works, is it generates a .WAV file which is then stored in the project, and this static file is used when the content is played, so you are only paying once, at generation time. And how often do you really need to (re-)generate your content – which is actually fairly light, being just bubble text?

But if you are interested in trying it and don’t want to commit funds just yet, you can sign up for a free demo (which is good for a year, and a 4 million character a month limit). Although you have to give your credit card details, there is no ‘auto sign-up’ at the end of the demo period, and you have to actively confirm that you do want to continue, so kudos to Google for sticking by their original mission statement of “Don’t be evil” on this one.

So, assuming you have an API key (paid or free trial), how do you use it? Within Enable Now Producer, you can invoke the service from the main Producer screen – where you can generate Text-to-Speech for multiple Projects at the same time) – or from within the Project Editor, where you can generate Text-to-Speech for one or more specific Steps within the simulation. Here. we’ll look at the second method (although the dialog box and results are exactly the same in both methods).

  1. If you only want to generate text-to-Speech for specific Steps, select these steps. (Note that you don’t need to convert the project to an ‘audio’ project first, like you have to for ‘Microsoft Text-to-Speech’ – Enable Now will automatically do this for you.)
  2. Select menu option Tools | Text-to-Speech |Google Text-to-Speech. The following dialog box is displayed:
  1. In the Mode field, select either All takes (should be All Steps) to generate Text-to-Speech for all selected Steps (or all Steps in the project if no steps were selected in Step 1) or Takes without audio to only generate Text-to-Speech for those Steps that don’t already have audio (this is a nice option so that you don’t accidentally re-generate text – and re-pay for this – when you don’t need to).
  2. The API URL field defaults to the location of the Google Cloud Text-to-Speech service, and should not need to be changed (there is an option to switch to the beta version if you know what you’re doing there).
  3. Enter your Google Cloud Text-to-Speech API Key in the API Key field. Note that this key is ‘remembered’ once you enter it, so you don’t need to enter it every time you want to convert text (it would have maybe been nice to have it as a one-time central Setting, but…).
  4. In the Voice for xxx field, select the voice that you want to be used. There are several to choose from. Note that the ‘WaveNet” voices are more natural sounding, but they are also more expensive. (I’ll include comparative examples later in this post.)
  5. If necessary, you can tweak the pitch and speed of the playback of the voice, via the sliders at the bottom of the dialog box.
  6. Click OK.

The audio is generated, and a confirmation message is displayed. Nice and simple. Similar to the ‘standard’ Text-to-Speech functionality, the Step now has an Audio icon on the right of the Step bar, and you can click on this to play the audio. The project is also automatically converted into an ‘Audio project’, which means that you can edit this via the Audio Editor if you need to do so. (In my testing you need to exit from the simulation project and then open it up again, to get the Audio menu to appear, but this may be an ‘early release feature’).

However, there is one large caveat with Google Cloud Text-to-Speech. It will always take the Bubble Text (and from the Demo mode, as this is the only mode that audio is played back), and will completely ignore any text that you may enter via a Text-to-Speech Override macro. If you’re in the habit of providing your own Step-level audio as an alternative to a straight reading of each Bubble text in turn, you may not like this limitation.

So how does it sound? Below I’ve included three sample files (which only seem to show in Chrome – sorry.). These were all created from within an Enable Now project, and I used exactly the same text in each of them to make comparison easier (I don’t know why the Microsoft one is a second shorter…)

Microsoft Anna
Google Cloud Standard Female C
Google Cloud WaveNet Female C

All of the above examples are in ‘American English’ (US-EN). This is because the project itself uses this language, and there is no option to change the language at Text-to-Speech creation time (see the dialog box above). This is probably a sensible design decision, but I did discover, when playing around with the demo box on the Google Text-to-Speech site that if you leave the text in English, but change the ‘conversion language’ to something else, the result is English speech but with a foreign accent! And for some reason, I find training content read to me in English with a strong French accent to be much more pleasurable 😉 .

There are a couple of other small oddities that I have noticed. For example, if you generate Google Cloud text-to-Speech for a Step, and then decide to go back to Windows Text-to-Speech, using the ‘old’ Generate Text-to-Speech won’t work and the Google audio is retained. You need to first delete the Google Audio, and then generate Windows Text-to-Speech. But most users are likely to use one method or the other, so this probably really isn’t that much of an issue. These kinds of ‘limitations’ are often teething problems, and SAP seem to be doing a good job of ironing them out over time.

So, that’s Text-to-Speech using the Google Cloud service, available in the Enable Now 1902 release. I’m sure you’ll agree that there is a marked difference between the Microsoft-generated voice and even the Google Cloud ‘Standard’ voice, and at the current pricing structure, this seems like a no-brainer. The Google WaveNet voice is another significant step-up, but at 4x the price of the Standard voice, it’s more of a judgment call. That said, it is certainly cheaper (and easier) than using live ‘voice talent’ and maintenance is simplified (you don’t need to find the same person) so maybe it’s not that hard to justify.

Caveat:
The above was based on a release preview, and things may change slightly (like fixing “takes” to “Steps” in the dialog box, or maybe taking into account override text) by the official release.