One of my ongoing projects is tired.com (as described in this Slate article by Paul Boutin). Since I have a fairly large corpus (to use the linquistic geek term) to play with, I occasionally do a little analysis on it. Here are the top 250 words used by the 6000 tired.com authors in 2004, in frequency order. You can probably figure out the gist of the subject of many of the letters from this list:
1. tired 2. i 3. and 4. the 5. of 6. a 7. my 8. am 9. it 10. in 11. you 12. because 13. is 14. that 15. for 16. this 17. have 18. me 19. not 20. so 21. at 22. up 23. im 24. sleep 25. all 26. do 27. on 28. with 29. but 30. or 31. just 32. get 33. no 34. be 35. why 36. are 37. was 38. can 39. work 40. like 41. go 42. if 43. out 44. about 45. night 46. may 47. time 48. what 49. now 50. don |
51. we 52. day 53. mail 54. really 55. know 56. your 57. too 58. they 59. people 60. any 61. as 62. had 63. by 64. then 65. much 66. life 67. want 68. when 69. been 70. who 71. being 72. he 73. e 74. an 75. one 76. she 77. school 78. more 79. there 80. her 81. will 82. part 83. its 84. hours 85. think 86. only 87. last 88. would 89. got 90. has 91. dont 92. well 93. some 94. going 95. new 96. email 97. how 98. back 99. even 100. please |
101. us 102. good 103. ve 104. u 105. enough 106. other 107. very 108. love 109. free 110. bed 111. which 112. them 113. feel 114. home 115. late 116. need 117. our 118. every 119. way 120. job 121. never 122. here 123. things 124. make 125. bored 126. still 127. their 128. morning 129. always 130. could 131. also 132. than 133. today 134. d 135. right 136. information 137. over 138. help 139. old 140. off 141. intended 142. after 143. around 144. take 145. image 146. friends 147. gif 148. site 149. tell 150. having |
151. went 152. getting 153. see 154. org 155. should 156. cause 157. two 158. ll 159. long 160. something 161. his 162. years 163. week 164. him 165. use 166. year 167. into 168. little 169. recipient 170. thanks 171. didn 172. unable 173. print 174. myself 175. doing 176. find 177. nothing 178. working 179. hard 180. read 181. maybe 182. where 183. again 184. makes 185. cuz 186. world 187. anything 188. early 189. until 190. house 191. money 192. cant 193. did 194. these 195. wake 196. down 197. best 198. days 199. everything 200. trying |
201. friend 202. lot 203. many 204. computer 205. live 206. ever 207. thing 208. before 209. stupid 210. better 211. most 212. say 213. those 214. same 215. yes 216. confidential 217. family 218. keep 219. since 220. person 221. care 222. ru 223. thank 224. college 225. come 226. sick 227. stay 228. virus 229. next 230. shit 231. bad 232. does 233. fucking 234. website 235. were 236. while 237. months 238. hate 239. though 240. able 241. oh 242. hour 243. hi 244. thats 245. girl 246. reason 247. web 248. let 249. first 250. kids |
Whoops, I thought I had taken out most of the non-content words, but it looks like "confidential", "recipient" and "information" slipped through. These usually come from disclaimers like this, which makes them especially amusing:
CONFIDENTIALITY NOTICE: This electronic message transmission is intended only for the person or the entity to which it is addressed and may contain information that is privileged, confidential or otherwise protected from disclosure. If you have received this transmission, but are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this information is strictly prohibited. If you have received this e-mail in error, please contact the sender of the e-mail and destroy the original message and all copies.
There are two dimensions in which differences can be compared: word frequency and popularity. The latter is easier to compare, since it just means alphabetizing the top 250 words in the emails and comparing them to the top 250 words in English (according to a list) with the unix 'comm' command (which is SO much better than 'diff' for this--I can't believe I never knew about it). They're pretty different, actually. The ones that only appear in the tired email are as follows:
able
always
am
anything
around
bad
because
bed
being
best
better
bored
cant
care
college
computer
confidential
cuz
d
days
didn
doing
don
dont
e
early
email
enough
every
everything
family
feel
first
free
friend
friends
fucking
getting
gif
girl
going
got
hate
having
hi
hour
hours
i
im
image
information
intended
into
its
job
kids
ll
lot
love
mail
makes
maybe
money
months
morning
myself
next
not
nothing
oh
org
person
please
print
really
reason
recipient
ru
shit
sick
since
site
sleep
something
stay
stupid
thank
thanks
thats
thing
things
those
tired
today
trying
u
unable
until
ve
virus
wake
web
website
week
working
years
yes
It'd be interesting to compare the order of this list with the order of the same words in the english language at-large (from wordcount.org). Obviously "Tired" would leap out ahead, but what else?