I recently wanted to know what the most frequent English alliterations are--so I figured it out. This is by no means a difficult programming task, but I thought it was at least interesting.
The first problem is where to get the necessary data to figure this out. After a bit of searching, I found wordfrequency.info. You can pay to download up to 155 million n-grams, but you can get the 1 million most frequent 2, 3, 4, and 5-grams for free.
I learned about a useful python module,
fileinput, that iterates over lines from multiple input streams. Here's the simple script I wrote:
import argparse import fileinput def find_alliteration(line, smallest_seq): seq = 1 words = line.split() letter = words for word in words[1:]: if word == letter: seq += 1 if seq >= smallest_seq: return line else: letter = word seq = 1 return None if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("-n", "--number", type=int, default=3, help="minimum number of words to count as alliteration") parser.add_argument("files", nargs=argparse.REMAINDER) args = parser.parse_args() count = 0 out = open('output.txt', 'w') for line in fileinput.input(args.files): found = find_alliteration(line, args.number) if found: count += 1 out.write(found) print("found " + str(count) +" alliterations")
I ran this on 3 million lines of frequency-tagged n-grams. Here are the five most common 3, 4, and 5-gram alliterations in English (with frequencies out of 450 million words):
17708 to try to 13413 to talk to 10364 to take the 9670 the things that 7925 think that the
1201 to talk to the 1090 to tell the truth 721 take the time to 545 to think that the 525 to talk to them
327 again and again and again 163 to take the time to 116 ha ha ha ha ha 75 taking the time to talk 74 the time to talk to
Some other interesting ones were "the tears that threatened to", "what we want when we", "as an administrative assistant at", and "tempting to think that the."Tags: linguistics, python, programming