Last semester I did a project for my phonetics class that I never got around to writing about. The long term goal is to be able to detect sentence type, emotion, and other characteristics in human speech, but for the project I only dealt with intonation detection.

The project can be found at, and the report I wrote is here. The report focuses on the linguistics side of the project, but I wanted to describe more of the programming side.

The first step in analyzing intonation is finding the fundamental frequency (pitch) of each input sample. At first I tried implementing my own simple algorithm using numpy's FFT implementation. It gave me some useful data, but not much. There were several things I could have tried in order to get better results from it, but as I was researching I came across aubio. Aubio is "a tool designed for the extraction of annotations from audio signals" including pitch detection. The library is written in C, but it has python bindings from SWIG. And it was perfect for what I needed. Essentially I fed the data into aubio, filtered it, and generated graphs from it.

First I created an array of Pitch objects:

# Represents a single pitch at a single point in time
Class Pitch:  
    def __init__:(self, pitch, time, intonation=0):
        self.pitch = pitch
        self.time = time
    # other fields and methods...

Then I filtered out data points that I didn't care about, points where no pitch was detected by aubio or above some limit (for my voice it was about 170Hz). After that I classified each pitch as increasing or decreasing based on the pitch before it, and added all sequential pitch changes that had the same classification.

The real key to the process seems to be the next step, filtering the intonations. I decided to use a "vertical filter" based on a pitch change threshold. Any pitches that didn't change by at least the threshold were discarded before calculating the final intonation pattern. This is the part where a lot of work could be done, because this type of filtering is overly simplistic based on the input. A better strategy would be to weight each pitch based on how much it changed and use the weights in determining the intonation pattern.

The plots were created with pylab

from pylab import cla, clf, plot, savefig  
# ...
# generate lists from pitch objects
times = [p.time for p in pitches]  
values = [p.pitch for p in pitches]  
plot(times, values)  

The cla() and clf() functions clear the axes and figure respectively. They were necessary because pylab seemed to keep the old plot around in some global state and would just plot the new pitches on top of the old ones.

The end result was taking input like this:


and turning it into this:


and giving it the classification "rising-falling-rising."

Tags: linguistics, python