シェイクスピアのソネットで使われた単語の使用頻度を調べる

Ruby で書きました。シェイクスピアの全作品はここにアーカイブされています。ソネットはその中のここを使いました。10回以上使われた単語を出力しています。

結果はこちらです。love(160), sweet(55), time(53), beauty(52), eyes & eye(51 & 38), heart(50) あたりが上位に来ているのがおもしろいですね。

and: 489, the: 444, to: 409, of: 371, my: 364, i: 341, in: 323, that: 320, thy: 266, thou: 234, with: 181, for: 171, is: 169, not: 167, but: 164, me: 164, a: 163, thee: 162, love: 160, so: 145, be: 141, as: 121, all: 117, you: 110, which: 107, his: 107, when: 106, it: 104, this: 103, by: 92, your: 89, doth: 88, do: 84, from: 82, on: 80, or: 79, no: 78, have: 75, then: 75, what: 70, are: 69, if: 68, more: 64, their: 63, mine: 63, shall: 59, sweet: 55, will: 53, time: 53, they: 53, beauty: 52, nor: 52, her: 51, eyes: 51, art: 51, heart: 50, yet: 50, o: 48, than: 48, can: 45, now: 44, should: 44, thine: 44, make: 43, hath: 43, one: 43, where: 42, he: 42, still: 41, how: 40, fair: 40, eye: 38, him: 37, am: 35, see: 35, she: 34, like: 34, true: 33, those: 33, though: 33, being: 32, such: 31, some: 31, every: 31, own: 30, were: 30, may: 29, live: 29, dost: 29, was: 29, myself: 29, who: 29, upon: 29, say: 28, praise: 28, love's: 27, give: 27, world: 27, most: 27, at: 26, might: 26, let: 26, did: 26, day: 25, why: 25, even: 24, since: 24, life: 23, new: 23, show: 23, truth: 22, look: 22, well: 22, old: 22, night: 22, dear: 21, thyself: 21, best: 21, thus: 21, must: 21, would: 21, these: 20, part: 20, whose: 19, worth: 19, false: 19, face: 19, nothing: 19, made: 19, better: 19, alone: 19, our: 18, beauty's: 18, away: 18, too: 18, thoughts: 18, ill: 18, against: 18, them: 18, thought: 18, much: 18, there: 18, hast: 17, therefore: 17, days: 17, sight: 17, an: 17, hand: 17, both: 17, know: 17, name: 17, other: 16, muse: 16, mind: 16, time's: 16, dead: 16, out: 16, far: 16, find: 16, had: 16, tell: 15, poor: 15, good: 15, we: 15, up: 15, each: 15, youth: 15, men: 15, before: 15, verse: 15, come: 15, age: 15, never: 15, think: 15, death: 15, things: 14, wilt: 14, till: 14, gentle: 14, state: 14, lie: 13, take: 13, black: 13, friend: 13, prove: 13, use: 13, whilst: 13, hate: 13, heaven: 13, proud: 13, many: 13, hold: 13, mayst: 12, none: 12, bear: 12, lies: 12, whom: 12, change: 12, die: 12, first: 12, thing: 12, making: 12, full: 12, hours: 12, looks: 12, woe: 12, rich: 11, earth: 11, summer's: 11, seem: 11, pleasure: 11, shalt: 11, yourself: 11, grace: 11, sun: 11, pride: 11, bright: 11, desire: 11, tongue: 11, knows: 11, ever: 11, others: 11, seen: 11, long: 11, happy: 11, kind: 11, form: 11, within: 11, any: 11, 'will': 11, another: 10, self: 10, nature: 10, deeds: 10, great: 10, soul: 10, pen: 10, leave: 10, again: 10, glass: 10, after: 10, right: 10, could: 10, shame: 10, write: 10, words: 10, fire: 10, end: 10, 'tis: 10, once: 10, call: 10, spirit: 10, place: 10,

スクリプトはこちら。nokogiri という gem が必要です。
テクストは、上に挙げたサイトから直接読み出しています。
単語の数は Word_count というクラスを使って数えています。それぞれの詩に含まれる単語を切り分けているのは、get_wd というメソッドです。

require 'open-uri'
require 'nokogiri'
require 'uri'

class String
  def remove_tag
    self.gsub(/<(".*?"|'.*?'|[^'"])*?>/, "")  
  end
end

class Word_count
  W_MIN = 10    #表示する単語数の下限  
  def initialize
    @word = {}  
  end

  def count(word)
    w = word.to_sym
    if @word.key?(w)
      @word[w] += 1
    else
      @word[w] = 1
    end
  end

  def output
    @word.sort {|(k1, v1), (k2, v2)| v2 <=> v1}.each do |x|
      next if x[1] < W_MIN     
      print "#{x[0]}: #{x[1]},  "
    end
  end
end

def geturl(url)
  url_ar = []
  Nokogiri::HTML(open(url)).css('dl a').each do |node|
    url1 = node.attribute('href').value
    url_ar << URI::join(url, url1).to_s
  end
  url_ar
end

def get_wd(url)
  word_ar = []
  open(url).each_line do |f|
    f = f.remove_tag.chomp.downcase.gsub(/(,|\.|:|;|!|\?)/, "").gsub(/--/, " ")
    next if /sonnet/.match(f) or f.empty?
    f.split(/ /).each do |i|
      next if i.empty?
      word_ar << i
    end
  end
  word_ar
end


c = Word_count.new
geturl("http://shakespeare.mit.edu/Poetry/sonnets.html").each do |url|
  puts url
  get_wd(url).each {|word| c.count(word)}
  sleep(1)
end
print "\n----Result----\n"
c.output

HTMLタグの除去については、こちらを参考にしました。