Introducing the Speech Synthesis API: Enhancing Web Interfaces with Voice

The Speech Synthesis API is a powerful tool offered by modern browsers that enables developers to create innovative interfaces and allow browsers to speak to users. This API, introduced in 2014, is now widely adopted and supported in Chrome, Firefox, Safari, and Edge (although not in IE).

With the Speech Synthesis API, developers can make their web pages talk by providing speech requests. To get started, simply use the speechSynthesis.speak() function with a new SpeechSynthesisUtterance object as the parameter. For example, you can try the following one-liner in your browser console:

speechSynthesis.speak(new SpeechSynthesisUtterance('Hey'));

The API provides several objects that can be accessed through the window object. One of the main objects is the SpeechSynthesisUtterance, which represents a speech request. You can customize the speech properties of the utterance object by adjusting parameters such as rate, pitch, volume, lang, text, and voice. For instance, you can change the pitch, volume, and rate using the following code:

const utterance = new SpeechSynthesisUtterance('Hey');
utterance.pitch = 1.5;
utterance.volume = 0.5;
utterance.rate = 8;
speechSynthesis.speak(utterance);

The Speech Synthesis API also allows you to choose different voices to provide a more personalized experience. To see the list of available voices, you can use the speechSynthesis.getVoices() function. However, note that accessing the voices list varies across different browsers due to a cross-browser issue. While the previous code works in Firefox and Safari, it doesn’t work in Chrome. For Chrome, you need to add a callback function called voiceschanged that is triggered when the voices have been loaded. Here’s an example of how to handle this difference:

const voiceschanged = () => {
  console.log(`Voices #: ${speechSynthesis.getVoices().length}`);
  speechSynthesis.getVoices().forEach(voice => {
    console.log(voice.name, voice.lang);
  });
};

speechSynthesis.onvoiceschanged = voiceschanged;

If you require an abstraction layer to handle this cross-browser difference, you can use the getVoices function provided in the code example below:

const getVoices = () => {
  return new Promise(resolve => {
    let voices = speechSynthesis.getVoices();
    if (voices.length) {
      resolve(voices);
      return;
    }
    speechSynthesis.onvoiceschanged = () => {
      voices = speechSynthesis.getVoices();
      resolve(voices);
    };
  });
};

const printVoicesList = async () => {
  (await getVoices()).forEach(voice => {
    console.log(voice.name, voice.lang);
  });
};

printVoicesList();

To use a specific language for the speech, you can set the lang property of the SpeechSynthesisUtterance object. For example, to use the Italian language, you can do the following:

let utterance = new SpeechSynthesisUtterance('Ciao');
utterance.lang = 'it-IT';
speechSynthesis.speak(utterance);

In cases where multiple voices are available, you can select a specific voice from the list. For instance, if you want to use a male Italian voice, you can modify the code as follows:

const lang = 'it-IT';
const voiceIndex = 1;

const speak = async text => {
  if (!speechSynthesis) {
    return;
  }
  const message = new SpeechSynthesisUtterance(text);
  message.voice = await chooseVoice();
  speechSynthesis.speak(message);
};

const getVoices = () => {
  return new Promise(resolve => {
    let voices = speechSynthesis.getVoices();
    if (voices.length) {
      resolve(voices);
      return;
    }
    speechSynthesis.onvoiceschanged = () => {
      voices = speechSynthesis.getVoices();
      resolve(voices);
    };
  });
};

const chooseVoice = async () => {
  const voices = (await getVoices()).filter(voice => voice.lang == lang);

  return new Promise(resolve => {
    resolve(voices[voiceIndex]);
  });
};

speak('Ciao');

The code snippet provided above retrieves a list of available voices using the getVoices function and filters it based on the specified language (lang). It then selects the voice at the specified index (voiceIndex) and assigns it to the voice property of the SpeechSynthesisUtterance object.

The Speech Synthesis API supports a wide range of languages, such as Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, and Turkish.

Please note that on iOS devices, the Speech Synthesis API requires user action to be triggered, such as in response to a tap event, to provide a better user experience and prevent unexpected sounds from playing on the device.

Tags: Speech Synthesis API, Web Speech API, Speech Recognition API