Presented by

  • Kathy Reid

    Kathy Reid
    @KathyReid@aus.social https://datacollective.mozillafoundation.org

    Kathy Reid works at the intersection of open source, emerging technologies and technical communities.

    Over the last 20 years, she has held several technical leadership positions, including roles as Digital Platforms and Operations Manager at Deakin University, managing platforms such as WordPress, Drupal, Squiz Matrix and Atlassian Confluence, technical lead on projects involving digital signage and videoconferencing, and has worked as a web and application developer.

    More recently, she has run her own technical consulting micro-business, and been engaged on a variety of projects involving data visualisation, certification applications and emerging technologies workshops.

    She was previously Director of Developer Relations at Mycroft.AI, an open source voice assistant startup, and President of Linux Australia, Inc, a not for profit organisation which advocates for the use of open source technologies and runs technical events such as Linux Conference Australia. She brought GovHack – the open data hackathon – to Geelong in 2015 and 2016 and in 2011 ran Geelong’s first unconference – BarCampGeelong. Most recently, she worked as a voice open source specialist for Mozilla.

    Kathy holds Arts and Science undergraduate degrees from Deakin University and an MBA (Computing) from Charles Sturt University, a Master in Applied Cybernetics (MAppCyber) from Australian National University, as well as several ITIL qualifications.

    In 2019, she was one of 16 people from across the world chosen to undertake a Masters Program in a brand new branch of engineering at the Australian National University's 3A Institute, where she is now a PhD candidate researching voice data and ways to prevent and respond to bias in machine learning systems that use voice and speech, like speech recognition.

    Kathy works for Mozilla Foundation as an engineer working with linguistic data and the Mozilla Common Voice and Mozilla Data Collective platforms.

Abstract

Who is this tutorial for and what problem does it solve?

Many folks in the Everything Open community run open voice assistants of some description, most likely Home Assistant with the Voice Preview. If you run Home Assistant fully locally for privacy, then under the hood you’re using faster-whisper, an implementation of OpenAI’s Whisper speech recognition engine implemented in cpp for speech and efficiency.

However, faster-whisper doesn’t always get it right, especially if you speak Australian - or with another accent.

How might we make faster-whisper work better for voices like yours?

Fine-tuning

The answer is fine-tuning. In machine learning, fine-tuning is the process of taking a trained model - like Whisper - and adjusting the internal weights and biases of the model - its internal mathematical representation - with data that has a closer distribution to the task you want to use the model for. In simple terms, it means teaching Whisper how to speak Australian!

Bonza. Or if you’re from Radelaide, heaps good.

Datasets for fine-tuning Whisper

But where do you get data to fine-tune Whisper from?

Enter the Mozilla Data Collective. The new home of Mozilla Common Voice datasets, the Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre. Built by and for the community in a transparent and ethical way - unlike other datasets collected via less scrupulous means - the Mozilla Data Collective houses one of the largest open English speech datasets in the world. MDC allows data contributors to create data — through Common Voice — curate that data, and then control who has access to that data for what purposes.

And since 2022, Mozilla Common Voice has allowed data contributors to specify the accents they speak with - and luckily - there are about 700 unique speakers in the Common Voice dataset who’ve indicated they speak with an Australian accent.

And we can use that data for fine-tuning Whisper, and help faster-whisper - and Home Assistant - work better for our voices.

Tutorial specifics (100 mins) Pre-requisites
  • The pre-prepared Australian-accented speech dataset
  • Python on your workstation
  • Some exposure to Python code, however the tutorial will go step by step

Optional

  • A Hugging Face account
  • Ideally, some exposure to Hugging Face Transformers
1-15 mins: Introduction and context setting (15 mins)
  • Kathy will provide an introduction to speech recognition models, and briefly outline why fine-tuning is often needed to make a speech recognition model work for particular voices.
  • She will cover the Whisper models, and an overview of the data that was used to train them - and explain why it has several shortcomings for uses such as Home Assistant. She will cover the trade-off between model size and accuracy, and why it is that smaller speech recognition models are used on embedded hardware, such as the Home Assistant Voice Preview device.
  • She will provide an overview of the Common Voice dataset, and how accents are represented in the dataset.
  • She will show how accent data can be extracted from the Common Voice dataset, but rather than spend the Tutorial time on this, will provide a pre-extracted dataset for people to use.
15 mins - 35 mins: Environmental setup (20 mins)
  • Drawing from the Mozilla.AI blueprint for fine-tuning ASR models using Mozilla Common Voice datasets, Kathy will help people set up the environment for the tutorial on their laptops, using Google Colab.
  • https://blueprints.mozilla.ai/all-blueprints/finetune-an-asr-model-using-common-voice-data
  • Additional time is allowed here because it can be difficult to set up and people may not have used it before. If participants have successfully set up their environment, they can move ahead with the tutorial.
35 mins - 55 mins: Data preparation steps (20 mins)
  • The most time consuming part of the tutorial will be the data loading and preparation steps. This requires e.g. conversion of audio files to a particular bitrate, and conversation of the dataset to a particular structure.
55 mins -75 mins: Fine-tuning using a GPU (20 mins)

In this step of the tutorial, the model is fine-tuned using the Common Voice data

75 mins - 85 mins: Evaluating the fine-tuned model (10 mins)
  • In this step of the tutorial, the model that has been fine-tuned is evaluated to see how well it works with participants’ voices.
85 mins - 90 mins: Discussion on what worked well and what didn’t, and the need for additional training data for fine-tuning (5 mins)
  • In this step of the tutorial, Kathy will lead a discussion on what worked well and what didn’t for fine-tuning. She will explore with participants what additional data would be useful for fine-tuning Whisper for use with Home Assistant, and some avenues for collecting this data, such as through Mozilla Common Voice and the Mozilla Data Collective.
90 mins - 95 mins: Converting fine-tuned model to faster-whisper format and replacing in Home Assistant
  • Using the faster-whisper repo, Kathy will demonstrate how to convert the trained model into the faster-whisper format for use in Home Assistant.
95 mins - 100 mins: Wrap up and close

Kathy will wrap up by leading a discussion on how well the fine-tuned model worked, and what additional data could make a Home Assistant model better for Australian English speakers.