Technical information about the GVC Emotion Recognition
GVC Neuro – Emo Analysis
Introduction
The GVC – Neuro & Emo Analysis Service detects emotions directly from the voice and from speech samples.
The Emotion Recognition software of Good Vibrations Company is web based. This enables a straightforward implementation in your own software. We use standard protocols like WS, WSS and JSON.
License
GVC provides its’ software on the basis of license agreements. After the license agreement has been concluded, you receive a Token that authorizes you to use the web service until the agreed expiration date.
Delivery of speech data
Speech samples must be delivered using the WS or WSS protocol. We currently support primarily PCM. Other audio formats, like WAV and MP3, are foreseen to be supported in the near future. A server connection is for a mono channel. If you want to handle a stereo channel you could combine the stereo channels to a mono channel or use two separate connections.
Emotion probability responses
The analysis of the speech results in an assessment of the probability of specific emotions in the individuals’ voice. The emotion probabilities of the signal processed via the voice channel will be reported in a fixed interval. The length of the interval in seconds can be configured with the URL key “ResponseTime”. The server will then send a response in JSON format.
Processing of the speech sample
The speech sample is processed in 2 phases. In the first phase we calculate some essential indicators that characterise the voice. These indicators are for example pitch, intensity, spectral slope, dynamics and jitter.
Neural Network
In the second phase we predict the emotions from the indicators using a neural network. The neural network is trained, among others, with the scientifically annotated speech emotion databases of several leading universities.
Connection Setup
To setup a connection you must open a standard WebSocket connection.
Command
Streaming Commands
When the connection is established you must send commands as text frames to the server. The sound data must be send as data frames. The server sends all the responses as text frames.
Audio
The service supports currently only pure PCM which must be coded between 8000 Hz and 48000 Hz. The sample size can be 2 or 3 bytes (16 or 24 bits). The samples can not be interleaved.
Output response format
The service returns all output in JSON.
The response contains the following standard fields.