[Javascript] 20. Speech Detection

winCow 2021. 5. 4. 00:01

1. 배경

마이크에 입력되는 음성을 인식하여 출력하는 기능이다. 음성 인식 성능 자체는 썩 좋지 않아 보인다.

2. HTML

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Speech Detection</title>
</head>
<body>

  <div class="words" contenteditable>
  </div>

</body>
</html>

3. CSS

    html {
      font-size: 10px;
    }

    body {
      background: #ffc600;
      font-family: 'helvetica neue';
      font-weight: 200;
      font-size: 20px;
    }

    .words {
      max-width: 500px;
      margin: 50px auto;
      background: white;
      border-radius: 5px;
      box-shadow: 10px 10px 0 rgba(0,0,0,0.1);
      padding: 1rem 2rem 1rem 5rem;
      background: -webkit-gradient(linear, 0 0, 0 100%, from(#d9eaf3), color-stop(4%, #fff)) 0 4px;
      background-size: 100% 3rem;
      position: relative;
      line-height: 3rem;
    }
    
    p {
      margin: 0 0 3rem;
    }

    .words:before {
      content: '';
      position: absolute;
      width: 4px;
      top: 0;
      left: 30px;
      bottom: 0;
      border: 1px solid;
      border-color: transparent #efe4e4;
    }

4. Javascript

  window.SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
  
  const recognition = new SpeechRecognition();
  recognition.interimResults = true;
  let p = document.createElement('p');
  const words = document.querySelector('.words');
  words.appendChild(p);

SpeechRecognition은 음성 인식 서비스를 위한 인터페이스로, 음성 인식 이벤트를 처리한다. recognition 변수에 SpeechRecognition 클래스를 선언하고, interimResults 속성을 true로 설정한다.interimResults 속성은, 음성 인식의 임시 결과를 반환할지 여부를 결정하는 것으로, true로 설정하면 임시 결과를 반환한다.

또, HTML 문서에 p 태그를 생성하는 변수를 p로 설정하고, words 변수에는 words 클래스를 가진 HTML 요소인 비어있는 div 태그를 선택한다. 이 words의 자손으로 p 태그가 생성되도록 appendChild를 사용한다.

  recognition.addEventListener('result', e => {
    const transcript = Array.from(e.results)
    .map(result => result[0])
      .map(result => result.transcript)
      .join('');
      
      p.textContent = transcript;
      if (e.results[0].isFinal) {
        p = document.createElement('p');
        words.appendChild(p);
      }
      if (transcript.includes('유니콘')) {
        console.log('🦄🦄🦄🦄🦄')
      }
    });
    
  recognition.addEventListener('end', recognition.start);
  recognition.start();

본격적으로 recognition에 이벤트 리스너를 걸어, result 이벤트가 전달되면 콜백함수를 실행하도록 한다. result 이벤트는, SpeechRecognition의 이벤트로, 음성 인식 서비스가 결과를 반환할 때 발생한다. 음성 인식 이벤트가 전달하는 것은 SpeechRecognitionResult 객체로, {0: SpeechRecognitionAlternative, length: 1, isFinal: false}와 같은 형태를 가지고 있다. SpeechRecognitionAlternative는 음성 인식 서비스가 인식한 단어의 key 값을 나타낸다. 이벤트가 전달되어 음성이 인식되면 실행될 콜백함수는, 음성 인식의 결과물들인 e.results을 배열로 만들고, map API를 이용해 배열의 모든 요소의 SpeechRecognitionAlternative만으로 이루어진 배열로 변환한다. SpeechRecognitionAlternative는 {transcript: "음성", confidence: 0.9236595630645752} 형태의 객체로 이루어져 있는데, transcript는 인식한 음성의 key 값이다. 다시 한 번 map을 이용해 transcript의 value들을 모은 배열을 만들고, join을 통해 구분자를 없앤다. 이를 통해 인식된 음성을 변수 transcript에 할당하였는데, 변수명 transcript와 SpeechRecognitionAlternative 객체의 transcript key는 다름에 주의하는 것이 좋다. 이후 p에 textContent로 텍스트노드를 추가하는데, 추가되는 노드는 인식된 음성을 할당했던 transcript가 된다.

isFinal 속성은 해당 결과가 최종인지(true) 아닌지(false)를 나타내는데, 전달된 음성 인식의 결과가 최종이라면 words의 자손 p를 생성한다. 이 코드를 입력하지 않으면, 음성 인식 결과 창이 한 줄에서 내용만 바뀌지만, 이 코드를 통해서 음성이 추가적으로 인식될 때마다 결과 창이 추가된다.

인식된 음성이 유니콘을 포함하는 경우 유니콘 아이콘을 콘솔 창에 띄울 수도 있는데, 이는 날씨를 물어보면 날씨를 알려주는 식으로 활용할 수 있다.

마지막으로, recognition의 end 이벤트에 반응해 recognition을 다시 시작해 주면, 말하기를 중단한 뒤에 다시 말해도 음성을 인식한다. 이 코드가 없으면 한 번 음성 인식이 중단된 후에는 음성을 인식하지 않는다.