System Overview - NK(O)APP

NKOAPP SYSTEM OVERVIEW
Technical Documentation

□ INTRODUCTION

NKOAPP is a specialized speech synthesis system that uses glottal and vocal tract parameters to generate realistic speech sounds. The system employs a pipeline architecture that processes parameters through various stages, from target configuration definition to final audio output, using the VocalTractLab API for the actual audio synthesis.

This document provides a comprehensive technical overview of the entire system, its architecture, primary components, and operational workflows. The system follows a modular architecture with distinct components handling specific parts of the speech synthesis pipeline.

□ SYSTEM ARCHITECTURE

The NKOAPP system follows a modular architecture with three main processing segments:

1. Configuration & Parameter Definition

Handles speaker data model (speakerJD2.json), glottal models, vocal tract shapes, and operational mode selection (manual/automatic).

2. Parameter Processing & Interpolation

Processes target sequences, applies spline interpolation, integrates f0 and intensity data, and generates tract sequences.

3. Audio Synthesis

Converts tract sequences to audio files using the VocalTractLab synthesis backend.

□ DATA PROCESSING WORKFLOW

The following diagram illustrates the data processing workflow within the NKOAPP system:

┌─────────────────────────┐

│ speakerJD2.json File │

└───────────┬─────────────┘

↓

┌───────────┴─────────────┐

│ JSON Reader Module │

│ (Extract Parameters) │

└───────────┬─────────────┘

↓

┌───────┴───────┐

│ Mode Check │

└───┬───────┬───┘

Manual │ │ Automatic

↓ ↓

┌──────────┐ ┌─────────────┐

│ User │ │ f0/Intensity│

│ Input │ │ Extractor │

└────┬─────┘ └──────┬──────┘

└────────┬────────┘

↓

┌─────────────────────────┐

│ Interpolation Engine │

│ (Spline/Akima Methods) │

└───────────┬─────────────┘

↓

┌───────────┴─────────────┐

│ Parameter Congealer │

│ (Glottis + Tract) │

└───────────┬─────────────┘

↓

┌───────────┴─────────────┐

│ Tract Sequence File │

└───────────┬─────────────┘

↓

┌───────────┴─────────────┐

│ VocalTractLab API │

└───────────┬─────────────┘

↓

┌───────────┴─────────────┐

│ Audio Output (.wav) │

└─────────────────────────┘

This pipeline highlights how the system processes data from initial configuration to final audio output, with different paths based on the selected mode (manual or automatic).

□ CORE COMPONENTS

1. NKOAPP.py (Main Orchestrator)

The main script that orchestrates the entire speech synthesis process. It:

Initializes system parameters and constants
Loads speaker configurations from JSON
Manages operational mode selection
Coordinates all processing modules
Generates final tract sequence files

Operational Modes:

manual = True  # Manual mode: user-defined sequences
manual = False # Automatic mode: f0/intensity data-driven
                        

Manual Mode: Users define target sequences explicitly through glOption and trOption arrays with desired glottal modes and tract configurations, along with durModulation arrays for timing control.

Automatic Mode: The system uses f0 and intensity data from external files to determine target sequences and transitions automatically.

2. Glottal Models

The system supports three different glottal models, each with different parameters:

Geometric Model

10 parameters controlling fundamental geometric properties of vocal fold vibration

Two-Mass Model

7 parameters modeling the vocal folds as a coupled oscillator system

Triangular Model

5 parameters using a simplified triangular glottal pulse shape

The selection of the glottal model affects the number of parameters to be interpolated and the overall speech synthesis process. Each model offers different trade-offs between computational complexity and physiological accuracy.

3. NKOAPP_JSON_Reader.py

This module handles extracting and parsing data from the speakerJD2.json file, which contains configurations for both vocal tract and glottis parameters.

Key Functions:

glotValues()  # Extracts glottis parameter values 
              # for different glottal modes

tractValues() # Extracts vocal tract parameter values 
              # for different vocal tract configurations

brownian()    # Generates brownian motion for 
              # randomized parameter trajectories
                        

4. NKOAPP_f0andIntensityExtractor.py

Provides functions for handling fundamental frequency (f0) and intensity data:

f0andIntensityReaderA() # Parses f0 and intensity values 
                        # from external files

f0andIntensityReaderB() # Generates synthetic f0 and intensity 
                        # values when external data is not available
                        

This module is essential for the automatic mode operation, providing the prosodic contour that drives the synthesis process.

5. NKOAPP_scripts.py

Contains the core interpolation and parameter processing functions:

splineInterpolatorAk()
    Performs cubic spline interpolation 
    for vocal tract parameters

splineInterpolatorAkSegmenter()
    Segments the interpolated parameter arrays

glotInterpolator()
    Interpolates glottal parameters and 
    integrates f0/intensity data

congealer()
    Combines glottal and vocal tract parameters 
    into a single sequence

targetNames()
    Maps target names to actual parameter values
                        

These functions form the computational core of the system, transforming discrete target configurations into continuous parameter trajectories suitable for audio synthesis.

□ SPEAKER DATA MODEL

The system relies on a structured JSON file (speakerJD2.json) that contains all necessary configurations for the vocal tract and glottis models:

Anatomical parameters: Physical dimensions of the vocal tract
Vocal tract shapes: Predefined articulatory configurations for different phones
Glottal configurations: Parameter sets for different phonation modes
Default values: Baseline settings for all synthesis parameters

This centralized data model allows for easy modification of speaker characteristics and ensures consistency across the synthesis pipeline.

□ AUDIO SYNTHESIS PIPELINE

The final step in the synthesis process is converting the tract sequence file to audio using the VocalTractLab API:

Current Implementation:

# Manual step:
Generate tract sequence file (.txt)
Open VocalTractLab GUI application
Load tract sequence file
Export as audio (.wav)
                        

Automated Implementation (Optional):

# Using tract2audio.py module:
import tract2audio
tract2audio.synthesize(
    tract_sequence_file='output.txt',
    output_audio='output.wav'
)
                        

The VocalTractLab backend uses articulatory synthesis principles to convert the continuous parameter trajectories into acoustic waveforms, providing high-quality, physiologically-grounded speech synthesis.

□ INTERPOLATION METHODS

NKOAPP supports multiple interpolation methods to control the smoothness and character of parameter transitions:

Cubic Spline Interpolation

Produces smooth, continuous trajectories with continuous first and second derivatives. Ideal for natural-sounding articulation.

Akima Spline Interpolation

Reduces overshoot and oscillation compared to cubic splines, providing more localized control over parameter trajectories.

Derivative Order Control

Each source-target pair can be differentiated to achieve cubic, squared, linear, or stepped curves. The choice significantly impacts whether a human quality is present in the output.

□ TECHNICAL SPECIFICATIONS

Sample Rate: 44.1 kHz (configurable)

Temporal Resolution: 2.5 ms per frame (110 frames at 44.1 kHz)

Vocal Tract Parameters: 17 parameters (see Table 1 on main NKOAPP page)

Glottis Parameters: 5-10 parameters (model-dependent)

Supported File Formats:

Input: JSON (speaker data), TXT (f0/intensity)
Output: TXT (tract sequences), WAV (audio)

□ TYPICAL USAGE WORKFLOW

1. Configuration Phase

Edit speakerJD2.json with desired vocal tract shapes and glottal configurations
Choose operational mode (manual or automatic)
Select glottal model type
Set interpolation method and temporal parameters

2. Target Definition Phase

Manual: Define glOption, trOption, and durModulation arrays
Automatic: Prepare f0 and intensity data files

3. Processing Phase

Run NKOAPP.py
System generates interpolated parameter trajectories
Tract sequence file is created

4. Synthesis Phase

Load tract sequence into VocalTractLab
Synthesize audio output
Export as WAV file

□ CONCLUSION

NKOAPP provides a flexible framework for speech synthesis based on physiological models of the vocal tract and glottis. Its modular design allows for both detailed manual control and automatic generation of speech parameters, making it suitable for research and experimental speech synthesis applications.

The system's strength lies in its ability to produce both highly naturalistic and deliberately unnatural vocalizations, making it particularly valuable for artistic and exploratory applications in voice synthesis and electroacoustic music composition.

Repository: github.com/hogobogobogo/NKOAPP