by Lawrence McGuire

An application written in Python and making use of open source programs such as Praat for sound analysis and the synthesis backend of VocalTractLab to allow for an advanced control of the vocal tract shapes and glottis properties over time.

The initial goals for this system were to allow for producing 'natural' and 'unnatural' sounding vocalizations and articulations, interpolations between mouth and glottis configurations in order to allow for continuous parameter changes of invariants (physical properties) and variants(f0, intensity, filtering coefficients) to go from one ‘speaker identity’ to another. Furthermore, using features from actual voice recordings then became an important way of using the system. Using variant parameters such as f0 & loudness to address prosodic, intonation and jitter and shimmer patterns in the voice recordings.

The source code will be made available this summer.

In broad terms the synthesis procedure follows the source-filter paradigm, where the vocal fold vibration serves as the excitation of the filter. The vocal tract then serves as the filtering of the glottal pulses.

Now, VocalTractLab is developed as a software for high precision articulatory speech synthesis and analysis. The fully parametric human speech model was meant to be used by speech scientists and phoneticians in order to simulate and analyze the articulator movements involved in various speech tasks in a non-intrusive way (not like in laryngoscopy). As a music making tool, the vanilla VocalTractLab interface allows the composer to work within certain ranges. Ranges that are implemented to simulate natural speech as accurately as possible. The motivation to hack this control system is to break free from the humanized constraints of the available control parameters, which are especially present in the form of simulating the systems’ inertia. Dissolving the idea of having a spatiotemporal resolution for each kind of vocal sound allows for the synthesis of humanly impossible articulations. Also changing between modes of phonation, relating to the modes of excitation of the glottis model (eg. modal, breathy, open, etc…) can now be done at very high modulation rates, up to 400Hz.

In the control system of NKOAPP, the system works with interpolations between sequences of presets for the glottis and vocal tract models. One starts with the source-preset and ends with a target-preset. A preset is a list of physical parameters that is needed to synthesize a particular phone. So, in order to produce an /a/ in a modal phonation, one preset is required for the vocal tract shape (eg. vocal tract shape of /a/) and one for the glottis excitation mode (eg. modal phonation), forming a preset pair. Where do these preset pairs come from? They are retrieved from what is called a speaker file, see figure 1 which defines a model speaker and comprises the anatomy of the speaker, the vocal tract shapes used to produce individual phones, as well as model properties for the glottal excitation. Adding and removing presets is done by a module that edits and reconfigures the entries of the speaker JSON file. In table 1 the names of the variables for both vocal tract shape and glottis excitation mode are listed. With the glottis one has to choose the geometry of the vocal fold model and this can not be changed during synthesis, see figure 2(slowed down x47).

The musician can generate x amount of source-target pairs either by manual input or by automated sequencing. Interpolating between these pairs is then possible by choosing between two spline interpolators, being cubic spline and akima spline interpolation. Each source-target pair can also be differentiated to achieve cubic, squared, linear or stepped curves. The choice of the derivative order has a significant impact on whether a human quality is present or not. See figure 3 for an example of the synthesis of the phrase \aoɢ\.
Then the individual duration between segments is either chosen or generated. This always is related to the total duration and amount of source-target pairs, or what was earlier referred to as the R ratio. We then combine all the glottis and vocal tract frames (110 frames at 44100Hz means 2.5 ms time steps) into a single text file, called a tractSequence file. This serves as the score for the VTL synthesis engine. The interfacing and synthesis is done through changing variable values and running the program within a python environment.

The system can be used in multiple ways, but the two generalized usages are ‘isolation’ and ‘combination’. In isolation, the system is used to generate sequences of phonations and articulations based on a set of chosen presets, durations, interpolation types and parameter modulations. The system can be used both for synthesis of accurate natural speech as for extreme articulations. Depending on the ratio R of total duration to amount of interpolations, it’s possible to impose extremely fast (R<<) or slow (R>>) rates of articulator movements.
For example, the spatiotemporal resolution of tongue movements(vowel-to-consonant transitions) produced by a person lies between 50 and 125 ms at a spatial resolution of 2.8 to 3.8 mm2[ref. 1] In the system the temporal resolution for any kind of speech task is bounded only by the chosen sample rate. At a sample rate of 44.1 KHz the highest resolution is 2.5ms at a spatial resolution dependent on initial physical constraints. For example, the spatial constraints for the jaw movement on the X axis (JX) is [-0.500mm ; 0.000mm] and the jaw angle is [-7.000° ; 0.000°]. See figure 4 [ref. 1] for the spatiotemporal resolutions of certain speech tasks. Listen to audio experiments here:

In isolation

In combination, with SV standing for Synthesized Voice and V for voice. The MASTER files are stereo versions of the SV and V files together, respectively panned left and right.

Developing API, GUI for interfacing. Implementation of an Iterative Optimization algorithm for parameter (filtering coefficients and vocal tract shapes) estimation to approach a certain perceptual feature or representation. Program sound analysis within the python environment mainly using the python package Librosa, so no need for Praat anymore.
In the following diagram Figure 5 a detailed diagram is shown to explain how I use this system. Keep in mind this system uses offline synthesis rendering.
Ref. 1: J Magn Reson Imaging. 2016 Jan; 43(1): 28–44. Published online 2015 Jul 14. doi: 10.1002/jmri.24997

Figure 1: Speaker tree diagram of file structure

Vocal Tract Glottis Excitation Mode
Positions/angles Geometric 2-Mass Triangular
Horz. hyoid pos.(HX) f0 ... ...
Vert. hyoid pos.(HY) Subglottal Pressure ... ...
Horz. jaw pos.(JX) Lower displacement Lower rest displ. Lower rest displ.
Jaw angle (JA) Upper displacement Upper rest displ. Upper rest displ.
Lip protrusion (LP) Chink area Extra arytenoid area Extra arytenoid area
Lip distance (LD) Phase lag Damping factor Aspiration strength
Velum shape Relative amplitude
Velic opening Double pulsing
Tongue body X Pulse skewness
Tongue body Y Flutter
Tongue tip X Aspiration strength
Tongue tip Y
Tongue blade X
Tongue blade Y
Tongue side elev. 1
Tongue side elev. 2
Tongue side elev. 3
Table 1: Control parameters for glottis and vocal tract **presets contain these values
Figure 2: Different glottis models/geometries for the same phonation of \a\.

Figure 3: Interpolations between glottis(Geometric Model) [modal,pressed,whisper] and vocal tract presets \a\,\o\ and \ɢ\

Figure 4: Spatiotemporal resolution of speech tasks

Figure 5: Step-by-step workflow with NKOAPP