Advanced Dynamics Silence Analyzer

SALSA Suite v2.5.0+

The SALSA Advanced Dynamics Silence Analyzer is a small add-on module which aims to increase the perceived accuracy of SALSA's automated lipsync processing. Currently, the silence analyzer is considered experimental as it is in its initial release. For this reason, the functionality is separated from the SALSA core system and must be separately added to any character where the feature is desired to be implemented. The goal of this feature is to add more nuance to automated lipsync by increasing timed mouth closings where otherwise the triggered viseme may keep the mouth open.

A Solution for (At Least) Two Problems

While SALSA is great at producing real-time, dynamic lip-sync, it isn't a phoneme mapping/baking solution. However, it does have some tricks up its sleeves for increasing the perceived accuracy of the results. There are a couple of problem areas the Silence Analyzer aims to assist with.

Increasing Smoothness of Animations

Depending on the look-and-feel the designer is after, smoothing out (slowing down) the animation timings may be desirable; however, the way to smooth an animation is to implement more frames for the animation in a given time slice. SALSA was developed from the ground up to be very lightweight and processor friendly. But, lips move fast when speaking...very fast for animation purposes. So how can we get more frames and/or more time? The easiest way is to increase the animation ON timing...this allows the animation to take longer to complete and is visibly smoother. The downside is however, the analysis timing pulse. SALSA periodically looks at a slice of audio and provides analysis and trigger selection. A balance must be implemented to analyze audio quickly enough to capture the dynamics of speech but slowly enough to let an animation get to the ON position. If we fire this pulse very frequently and give the animation ON timing sufficient time to make a smooth transition, the animation rarely has time to get to its fully ON state, resulting in a much diminished animation dynamic...a mumbling or whispering effect. We can compensate for this by shortening the animation ON timing and while we get a more responsive analysis, the short ON timings create a jittery/chattery look and feel. If we increase the analysis pulse timing and increase the ON timing, we get a much smoother look, but begin to lose dynamic response to the audio and loss of perceived accuracy.

Using the Advanced Dynamics Silence Analyzer, we can increase the animation ON timing and increase the processing delay pulse time and still respond more quickly to audio dynamics. The Silence Analyzer system attempts to more accurately detect gaps and micro gaps in audio and implements dynamics processing outside of the normal SALSA audio processing delay (pulse) cycle. The main feature of the Analyzer in this instance is early detection of silence and the ability to shut triggered visemes down when it is detected. This allows for a longer general pulse time to allow slower animation ON timings and a fall-back system to look for silence and preemptively take action. This creates more instances where the mouth animations respond better to the dynamics timings of the audio file -- quicker shutdown of triggered visemes when silence is detected. Keep in mind, as with most things, there is a state of diminishing returns if any of these values begin to exceed certain nuance-related boundaries.

Better Response to Closed-Mouth Phonemes

Generally speaking, the dynamic presentation of lip synchronization via SALSAv2's processing is very good and there are many technologies implemented to increase the perceived accuracy of SALSA lip-synchronized speech. Since SALSA utilizes a very lightweight mechanism to process audio, it can miss occasions where the audio has dropped or the nuance of lip-sync requires a closed-mouthed viseme (such as P, M, B sounds). Since SALSA does not process or map phonemes, it cannot accurately display these visemes, especially at precisely required times.

Instead, the Silence Analyzer tries to identify more scenarios where the sound of a voice briefly stops (or drops out) to create a closed-mouth sound. Of course, this isn't always the case of what happens during speech, so it isn't an all-encompassing, completely accurate solution. However, it does create more instances of open-to-closed-to-open viseme representation where closed-mouth sound dynamics exist. Does it catch them all? No. But we are aiming to keep the goal of ultra-fast audio processing and dynamic lip-synchronization with a very light-weight, real-time implementation. By creating more time-responsive nuance, we can further increase the perceived accuracy of SALSA's lip-sync visualization -- Advanced Dynamics Silence Analyzer can help gain a few more percentage points of perceived accuracy for the SALSA system.

Implementation

As mentioned in the overview, Advanced Dynamics Silence Analyzer is a new feature implementation and has been separated from the core SALSA code and functionality to ensure customers are aware and in control of the implemented features.

With the SALSA 2.5.2.110 update, we have added a configuration fix to the Silence Analyzer to be smarter about working with SALSA configurations. In instances where SALSA is driven by external analysis, the Silence Analyzer cannot be used and removes itself at runtime. If the SALSA configuration is set to wait for an AudioSource, Silence Analyzer will patiently wait for SALSA to find the AudioSource and then respond to SALSA's "found it" event by setting its salsaCanAnalyzeAudio flag to true. This flag has been made public if you require granular control of all configuration. I.e. in the instance SALSA is configured to wait for an AudioSource, the AudioSource is found, and then programatically removed and re-added later.

Add the Silence Analyzer to a GameObject

To add the Silence Analyzer to your character, you can do so from the Component menu option in the Unity Editor:

Component > Crazy Minnow Studio > SALSA LipSync > Add-ons > Silence Analyzer

Additionally, you can drag-and-drop the component on your object. Or select "Add Component" from your GameObject's Inspector tab.

## Configure the Silence Analyzer

Generally speaking, the default settings of the Silence Analyzer will be recommended for most situations. However, the Analyzer has 6 parameters to be aware of and adjust, depending on your audio quality and design requirements:

advanced dynamics silence analyzer settings

The reference to the GameObject's SALSA instance. This is needed to augment SALSA's processing and control. The component should find your SALSA instance automatically if it is placed on the same GameObject.

NOTE: If placed on an object that does not have a SALSA component, the settings will not be visible until a SALSA module is available. This is due to the realtime calculations made and displayed in the Advanced Silence Analyzer.
Silence Threshold: In SALSA's Settings > Advanced Dynamics section, there is a linear scaled cutoffs control. The low cutoff specifies the noise floor of the audio file/input. To use the Silence Analyzer, the low cutoff should be non-zero and likely in the 0.01 - 0.04 range (although this greatly depends on your audio quality). The silence threshold setting reads this cutoff value and establishes it as the high end of the threshold value. This is a normalized range [0.0 .. 1.0] and can be considered a percentage of the low-cutoff value. To the right of this setting, you will see a calculation of the effective value.
StartPoint: The starting position of silence analysis based on SALSA's audio processing delay time. For example, if the audio process delay time is 0.09f, the StartPoint can be set from 0.0 to 0.9f (effective) using a normalized scale [0.0 .. 1.0]. The computed effective value is displayed to the right of this setting.
EndPoint: The ending point of silence analysis is based on the StartPoint and SALSA's audio processing delay time. Effectively, this selects the end point of a range from the StartPoint to the delay time. It is also a normalized scale [0.0 .. 1.0] with 0.0 representing the StartPoint and 1.0 representing the remaining range, up to the delay time.
Silence Weight: This is another normalized value that controls the acceptable limits of samples exceeding the silence threshold. Think of this as: a percentage of samples required to meet (be under) the silence threshold limits.
Buffer Size: Pretty straight forward -- this is the number of audio sample points used to look for silence breaks in the audio. Fewer samples are more likely to find gaps, more samples are less likely. But, bare in mind, it isn't necessarily desirable to find every little, tiny piece of "silence". It is a balance and likely the default settings are going to work well.
There is also a counter indicator at the bottom of the panel and this indicates a running tally of detected and processed silence. This is purely an indication of how much (or little) the component is assisting SALSA in implementing viseme closures on silence detection.

NOTE: it is possible to substitute custom silence detection code for the built-in processing. See the API documentation below.

# API()

Salsa salsaInstance

Reference to the SALSA instance Silence Analyzer manipulates.

bool salsaCanAnalyzeAudio

This flag operates as a partial guard-clause for Silence Analyzer processing. If programatically removing the AudioSource from the SALSA instance, set this flag to false prior to removing the AudioSource or its reference.

salsaCanAnalyzeAudio = false;

[Range(0f,1f)] float silenceThreshold = 0.8f

A normalized representation of the maximum audio sample level which can be considered "silence".

[Range(0f,1f)] float timingStartPoint = 0.4f

A normalized representation of the start point during the audio analysis timing pulse where silence detection can be processed.

[Range(0f,1f)] float timingEndVariance = 1.0f

A normalized representation of the end point during the audio analysis timing pulse where silence detection can be processed. A value of 0.0f is zero variance from the start point. A value of 1.0f is the max time of the audio analysis timing pulse. This value (1.0f) is recommended to allow silence processing to occur from the start point all the way up to the full time of the audio analysis delay timing pulse.

[Range(0f,1f)] float silenceSampleWeight = 0.9f

A normalized representation of the minimum percentage of sample points that must fall below the silenceThreshold. The default value (0.9f) requires 90% of sample values to fall within the "silence" threshold limits.

int bufferSize = 512

The number of audio samples to be processed for silence detection. This value will vary depending on your audio sample rate/quality and your desired design implementation requirements. It is recommended to set bufferSize to a value representative of the audio frequency being analyzed. For example, if using a microphone at 11KHz, bufferSize of ~192 may be more appropriate.

SilenceAnalyzer delegate bool silenceAnalyzer: float[]

An optional delegate reference which can be used to substitute custom code for the analysis portion of the Silence Analyzer's processing. See the related SALSA information discussing processing delegates.

Substituting a Custom Silence Analyzer

Below is a sample implementation of a Silence Analyzer substitution.

using CrazyMinnow.SALSA;
using UnityEngine;

public class CustomSilenceAnalysisPlugin : MonoBehaviour {

    // In Start() or Awake() set the delegate (silenceAnalyzer) to your custom
    // method call.

    void Awake()
    {
        GetComponent<SalsaAdvancedDynamicsSilenceAnalyzer>().silenceAnalyzer = CustomSilenceAnalysis;
    }

    private bool CustomSilenceAnalysis(float[] audioData)
    {
        bool isSilenceDetected = false;

        // loop through the first channel of the data slice.
        var channels = salsaInstance.clipChannels();
        for (int i = 0; i < audioData.Length; i += channels)
        {
            // ...custom silence detecting code...
        }
        return isSilenceDetected;
    }
}