I recently wanted to know what the most frequent English alliterations are--so I figured it out. This is by no means a difficult programming task, but I thought it was at least interesting.

The first problem is where to get the necessary data to figure this out. After a bit of searching, I found wordfrequency.info. You can pay to download up to 155 million n-grams, but you can get the 1 million most frequent 2, 3, 4, and 5-grams for free.

I learned about a useful python module, fileinput, that iterates over lines from multiple input streams. Here's the simple script I wrote:

import argparse  
import fileinput

def find_alliteration(line, smallest_seq):  
    seq = 1
    words = line.split()
    letter = words[0][0]
    for word in words[1:]:
        if word[0] == letter:
            seq += 1
            if seq >= smallest_seq:
                return line
        else:
            letter = word[0]
            seq = 1
    return None

if __name__ == "__main__":  
    parser = argparse.ArgumentParser()
    parser.add_argument("-n", "--number", type=int, default=3,
                        help="minimum number of words to count as alliteration")
    parser.add_argument("files", nargs=argparse.REMAINDER)
    args = parser.parse_args()

    count = 0
    out = open('output.txt', 'w')

    for line in fileinput.input(args.files):
        found = find_alliteration(line, args.number)
        if found:
            count += 1
            out.write(found)

    print("found " + str(count) +" alliterations")

I ran this on 3 million lines of frequency-tagged n-grams. Here are the five most common 3, 4, and 5-gram alliterations in English (with frequencies out of 450 million words):

3-gram alliterations

17708 to  try to
13413 to  talk  to
10364 to  take  the
9670  the things  that
7925  think that  the

4-gram alliterations

1201  to  talk  to  the
1090  to  tell  the truth
721   take  the time  to
545   to  think that  the
525   to  talk  to  them

5-gram alliterations

327 again and again and again
163 to  take  the time  to
116 ha  ha  ha  ha  ha
75  taking  the time  to  talk
74  the time  to  talk  to

Some other interesting ones were "the tears that threatened to", "what we want when we", "as an administrative assistant at", and "tempting to think that the."

Tags: linguistics, python, programming