Implement Shion(詩音) from SingaBitofHarmony(讓我聽見愛的歌聲) with Python

nikkie

nikkie

Nikkie began his career as a software engineer in 2016. He started Python as a hobby in 2017 and fell in love with it. He is engaged in Natural Language Processing as a data scientist at Uzabase, inc. Tokyo, Japan from 2019. He is working on the Python community in Japan as a staff of the following event: - [PyCon Japan](https://www.pycon.jp/organizer/index.html): the largest PyCon in Japan - staff on 2019 and 2020 (Program committee, lead on 2020) - [chair](https://pyconjp.blogspot.com/2020/10/pyconjp-2021-chair.html) on 2021 He gave a talk (and lightning talks) at many PyCons in Japan and abroad. - EuroPython 2020, [PyCon APAC 2020](https://youtu.be/JiXnEA7pM7U) (English) He loves anime (Japanese animetation) as much as Python, and implements ideas related to some anime with Python. In 2022, he write code related to "Sing a Bit of Harmony" (e.g. Twitter bot, prototyping AI character, e.t.c.).

  • Intro
  • More Info
  • Slido
  • Note

Abstract

How can we create a program that can speak (not write) with a human? I love anime and fell in love with a movie "Sing a Bit of Harmony"(讓我聽見愛的歌聲). The character, AI (robot) Shion, is very attractive from an engineer's point of view, and I wanted to implement even some of its functions. I implemented shion.py, which allows humans to enter text by voice and the script responds by voice. In short, it is like a smart speaker that parrots. In other word, the program reads aloud the spoken texts. I started with an easy implementation (with Web API and OS command) to check the idea and then reworked it with pre-trained machine learning models to get closer to Shion. I will share those implementations with you. I would be happy to provide a little inspiration for your Maker project. Keywords like hashtag: #TTS, #ASR, #subprocess, #SpeechRecognition, #ttslearn #ESPnet, #soundfile, #HuggingFace

Description

Background

Sing a Bit of Harmony

The movie: Sing a Bit of Harmony

In my opinion, this is an awesome film.
It has the distinction of winning several film festivals:
https://en.wikipedia.org/wiki/Sing_a_Bit_of_Harmony#Reception

In October 2021, Sing a Bit of Harmony won the Audience Award at the Scotland Loves Animation film festival.

Motivation

In Japan, fans support their favorite animated films by drawing illustrations.
I cannot draw illustrations, but I wanted to support this movie somehow.

In this film, an AI (robot, android) named Shion plays an key role.
Since Shion is an AI, some parts of it can be reproduced by writing a program.
So, instead of illustrations, I decided to support this film by implementing some of Shion's features.

Technical Details

Define: Shion v0.0.1

I started with implementation of Shion small.
I implement Shion as software. (As for the hardware, it is a future work)

I defined Shion v0.0.1 as a program that enables the following:

  1. We inputs voice into Shion (the program)
  2. Shion transcribes speech into text
  3. Shion processes the text
  4. Shion reads the text out loud as response

Text processing is also worth devising, but this time the focus is on handling speech.

Techniques to implement Shion

  • Text-To-Speech enables a program to read text out loud
  • Automatic Speech Recognition enables us to input voice into a program

Text-To-Speech (a.k.a TTS)

  • call OS command [1] (easy to implement, but depends on the environment)
  • use a pre-trained machine learning model

[1]: call say command (macOS) like https://docs.python.org/3/howto/logging-cookbook.html#speaking-logging-messages

Automatic Speech Recognition (a.k.a ASR)

  • call API like Cloud Speech-to-Text API [2]
  • use a pre-trained machine learning model (to provide this feature without Internet access. Shion is standalone)

[2] https://cloud.google.com/speech-to-text

Caveats

⚠️Mainly deals with ASR and TTS in Japanese. Best effort for ASR and TTS in English.
⚠️I am a beginner of ASR and TTS, so the focus will be on what implementations are possible. (I will not deal with the theory)

Video