Adding speech recognition to an Ionic App

In this blog post, I present two different approaches for incorporating speech recognition into a web app, specifically an Ionic app.

Important update: this article now focuses entirely on browser-based approaches, and browser support for speech APIs is still fragmented enough that you should choose the approach based on your target platforms.

As an example, I created a simple movie search database with an Ionic front end and a Java/Spring Boot back end. The user can search for movie titles by speaking into the microphone, which a speech recognition library will transcribe into text. The web app then sends a search request to the Spring Boot application, where it searches for matching movies stored in a MongoDB database.

Test data ¶

Before I started with the app, I needed some data about movies to insert into the database. The Internet Movie Database (IMDb) provides a set of raw data files for free. You can find all the information about the data files here: https://developer.imdb.com/ Note that these data files are only available for personal and non-commercial use.

To download and import the data files, I wrote a Java application. The program parses the files with the Univocity library (the data in the files are stored as tab-separated values) and then inserts the data into a MongoDB database.

You can find the complete code of the importer on GitHub: src/main/java/ch/rasc/speechsearch/ImportImdbData.java

Client ¶

The client is written with the Ionic framework and based on the blank starter template. The app displays the movies as a list of cards and provides buttons for the two speech-search mechanisms covered in this post.

Server ¶

The server is written in Java with Spring and Spring Boot. The search controller is a RestController that handles the search requests from the client.

  @GetMapping("/search")
  public List<Movie> search(@RequestParam("term") List<String> searchTerms) {

    Set<Movie> results = new HashSet<>();
    MongoCollection<Document> moviesCollection = this.mongoDatabase
        .getCollection("movies");

    MongoCollection<Document> actorCollection = this.mongoDatabase
        .getCollection("actors");

    List<Bson> orQueries = new ArrayList<>();
    for (String term : searchTerms) {
      orQueries.add(Filters.regex("primaryTitle", term + ".*", "i"));
    }

    try (MongoCursor<Document> cursor = moviesCollection.find(Filters.or(orQueries))
        .limit(20).iterator()) {
      while (cursor.hasNext()) {
        Document doc = cursor.next();
        Movie movie = new Movie();
        movie.id = doc.getString("_id");
        movie.title = doc.getString("primaryTitle");
        movie.adult = doc.getBoolean("adultMovie", false);
        movie.genres = doc.getString("genres");
        movie.runtimeMinutes = doc.getInteger("runtimeMinutes", 0);
        movie.actors = getActors(actorCollection, (List<String>) doc.get("actors"));
        results.add(movie);
      }
    }

SearchController.java

The search method takes a list of search terms and starts a regular expression search over the movie title in the movies collection. The application limits the results to 20 entries and uses a cursor to iterate over the returned entries. It then converts the matching documents to POJOs (Movie) and returns them in a list to the client.

1. Web Speech API ¶

Next, I looked for a way that works without any native plugins. The good news is there is a specification that provides this functionality, the Web Speech API. The bad news is that support is still fragmented. In practice, speech recognition support remains best in Chromium-based browsers, while Firefox still does not implement it and support in other engines remains uneven. Check the current support table on Can I Use before choosing this approach.

The API is callback-based, and an app has to implement a few handlers. The searchWebSpeech method first checks if a speech-recognition constructor is available in the window object. Then it instantiates either SpeechRecognition or the still-common prefixed webkitSpeechRecognition object that handles the speech recognition.

  private getSpeechRecognitionCtor(): any {
    return this.speechWindow.SpeechRecognition ?? this.speechWindow.webkitSpeechRecognition;
  }

home.page.ts

  searchWebSpeech(): void {
    const SpeechRecognitionCtor = this.getSpeechRecognitionCtor();
    if (!SpeechRecognitionCtor) {
      return;
    }

    const recognition = new SpeechRecognitionCtor();
    recognition.continuous = false;

    recognition.onstart = () => {
      this.isWebSpeechRecording = true;
      this.changeDetectorRef.detectChanges();
    };

    recognition.onerror = (event: any) => console.log('error', event);
    recognition.onend = () => {
      this.isWebSpeechRecording = false;
      this.changeDetectorRef.detectChanges();
    };

    recognition.onresult = (event: any) => {
      const terms = [];
      if (event.results) {
        for (const result of event.results) {
          for (const ra of result) {
            terms.push(ra.transcript);
          }
        }
      }

      this.movieSearch(terms);
    };

    recognition.start();
  }

home.page.ts

Because I also wanted to disable the button when the recording is running, the onstart and onend handlers set and reset a flag.

The API automatically recognizes when the user stops speaking. It then transcribes the speech into text. This is what the onresult handler receives as a parameter. In this handler, the code collects all the transcriptions into one array and calls the movieSearch method that sends the request to the server.

I tested this on Chrome on a Windows desktop, and it worked very well. Today, this approach is still attractive when you control the browser choice and can rely on Chromium-based browsers.

2. Recording with WebRTC and sending it to the Google Cloud Speech API ¶

The first approach works well but is limited to Chromium-based browsers, and I was more interested in a solution that works in most modern browsers without any additional plugin.

With WebRTC, it's not that complicated to record an audio stream in almost all modern browsers. There is an excellent library available that smooths out the different WebRTC implementations and can record audio: RecordRTC

The example I present here runs on Edge, Firefox, and Chrome on a Windows computer. I haven't tested Safari, but version 11 has a WebRTC implementation, so I guess the example should work on Apple's browser too.

After the speech is recorded, the app transfers it to our server and from there to the Google Cloud Speech-To-Text API. This is a service that transcribes spoken words into text. This service is not free, check the pricing page for details.

Unlike the Web Speech approach, the application has to handle starting and stopping the recording manually. For that, the method uses a boolean instance variable and sets it to true when the recording is running. The user then needs to click the Stop button when they are finished speaking.

  async searchGoogleCloudSpeech(): Promise<void> {
    if (this.isRecording) {
      if (this.recorder) {
        // eslint-disable-next-line @typescript-eslint/no-unused-vars
        this.recorder.stopRecording(async (_: any) => {
          const recordedBlob = this.recorder.getBlob();

          const headers = new Headers();
          headers.append('Content-Type', 'application/octet-stream');

          const requestParams = {
            headers,
            method: 'POST',
            body: recordedBlob,
          };
          const response = await fetch(`${environment.serverUrl}/uploadSpeech`, requestParams);
          const searchTerms = await response.json();
          this.mediaStream?.getTracks().forEach((track) => track.stop());
          this.mediaStream = null;
          this.movieSearch(searchTerms);
        });
      }
      this.isRecording = false;
    } else {
      this.isRecording = true;
      this.mediaStream = await navigator.mediaDevices.getUserMedia({ video: false, audio: true });
      const options = {
        mimeType: 'audio/wav',
        recorderType: RecordRTC.StereoAudioRecorder,
      };
      this.recorder = RecordRTC(this.mediaStream, options);
      this.recorder.startRecording();
    }
  }

home.page.ts

When the user starts the recording, the method accesses the audio stream with getUserMedia and calls the RecordRTC object with the stream as a source. In this example, I set the audio format to wav and use the stereo recorder. This works fine on all three browsers I tested. When the recording stops, the method receives a blob from RecordRTC that contains the recorded audio in wav format. Then it uploads the binary data to Spring Boot (/uploadSpeech) and waits for the transcription to return. After that, it calls the movieSearch method.

On the server, I use the google-cloud-speech Java library to connect the application with the Google Cloud. The project needs this dependency in the pom.xml

    <dependency>
      <groupId>com.google.cloud</groupId>
      <artifactId>google-cloud-speech</artifactId>
      <version>4.88.0</version>
    </dependency>

pom.xml

Before you can access a service in Google Cloud, you need a credentials file. To get that, log in to your Google Account and open the Google Cloud Console. There you either create a new project or select an existing one and add the Google Cloud Speech API to the project. Then open the credentials menu and create a new service account. You can then download a JSON file that you can add to the project. Don't commit this file into your git repository; it contains sensitive information that allows anybody who has the key to access the API. In this application, I externalize the path to this credential file with a configuration property class (AppConfig).

The SearchController needs to create a SpeechClient instance that allows it to send requests to the Google Cloud Speech API. The code first reads the credential file and then creates an instance of this class.

  public SearchController(AppConfig appConfig) throws IOException {
    MongoClientSettings mongoClientSettings = MongoClientSettings.builder()
        .writeConcern(WriteConcern.UNACKNOWLEDGED).build();
    this.mongoClient = MongoClients.create(mongoClientSettings);
    this.mongoDatabase = this.mongoClient.getDatabase("imdb");
    this.ffmpegPath = appConfig.getFfmpegPath();

    ServiceAccountCredentials credentials = ServiceAccountCredentials
        .fromStream(Files.newInputStream(Path.of(appConfig.getCredentialsPath())));
    SpeechSettings settings = SpeechSettings.newBuilder()
        .setCredentialsProvider(FixedCredentialsProvider.create(credentials)).build();
    this.speech = SpeechClient.create(settings);
  }

SearchController.java

The last piece of the puzzle is the handler for the /uploadSpeech endpoint. This method receives the bytes of the recorded speech sample in wav format and stores them in a file.

  @PostMapping("/uploadSpeech")
  public List<String> uploadSpeech(@RequestBody byte[] payloadFromWeb) throws Exception {
    String id = UUID.randomUUID().toString();
    Path inFile = Files.createTempFile("speechsearch-" + id, ".wav");
    Path outFile = Files.createTempFile("speechsearch-" + id, ".flac");

    try {
      Files.write(inFile, payloadFromWeb, StandardOpenOption.TRUNCATE_EXISTING);

      FFmpeg ffmpeg = new FFmpeg(this.ffmpegPath);
        FFmpegBuilder builder = new FFmpegBuilder().overrideOutputFiles(true);
        builder.setInput(inFile.toString());
        builder.addOutput(outFile.toString()).setAudioSampleRate(44_100)
          .setAudioChannels(1).setAudioSampleFormat(FFmpeg.AUDIO_FORMAT_S16)
          .setAudioCodec("flac").done();

      FFmpegExecutor executor = new FFmpegExecutor(ffmpeg);
      executor.createJob(builder).run();

      byte[] payload = Files.readAllBytes(outFile);

      ByteString audioBytes = ByteString.copyFrom(payload);

      RecognitionConfig config = RecognitionConfig.newBuilder()
          .setEncoding(AudioEncoding.FLAC).setLanguageCode("en-US").build();
      RecognitionAudio audio = RecognitionAudio.newBuilder().setContent(audioBytes).build();

      RecognizeResponse response = this.speech.recognize(config, audio);
      List<SpeechRecognitionResult> results = response.getResultsList();

      List<String> searchTerms = new ArrayList<>();
      for (SpeechRecognitionResult result : results) {
        SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
        searchTerms.add(alternative.getTranscript());
      }

      return searchTerms;
    }
    finally {
      Files.deleteIfExists(inFile);
      Files.deleteIfExists(outFile);
    }

SearchController.java

The problem I had here was that the Cloud Speech API could not handle the wav file it gets from the web app. One problem is the unsupported format, and the other is the recording is in stereo, but the Speech API requires mono recordings. Unfortunately, I haven't found a pure Java library that can convert sound files. However, there is a way to do that with a native application and still support multiple operating systems.

ffmpeg is a program for handling multimedia files. One task it provides is converting audio files into other formats. On the download page, you can find builds for many operating systems.

Today, I recommend installing ffmpeg so it is available on your PATH, or configuring the executable path explicitly with the app.ffmpegPath property. To call the executable from the Java code, I found a Java wrapper library that simplifies setting parameters and calling the program.

    <dependency>
      <groupId>net.bramp.ffmpeg</groupId>
      <artifactId>ffmpeg</artifactId>
      <version>0.9.2</version>
    </dependency>

pom.xml

In the code, you can see how I use this library to specify the configuration parameter that converts the wav file into a mono flac file. Flac is one of the supported audio formats of the Cloud Speech API. Calling the API itself is straightforward when you have the recording in a supported format. All it needs is a call to the recognize method with the binary data of the recording and a few configuration parameters like the description of the format.

RecognizeResponse response = this.speech.recognize(config, audio);

The method returns the text transcription when the service is able to understand some words in the recording. The uploadSpeech method then sends back these strings to the Ionic app.

Wrapping up ¶

This concludes our journey into speech recognition land.

The server-backed solution is more complicated because it also depends on a server part, but it supports a much broader range of browsers and does not depend on a native plugin.

The Web Speech API remains attractive when you can standardize on Chromium-based browsers. If you need broader browser coverage, a server-backed transcription flow is still the more predictable option.