Home * Humor and Games * Code * Music * Mathematics * Discussion * Reading * Site news * Film * Baseball * Verse * Links * Search |
The Code Lair |
||||||||||||||||||
Search the Ridiculously Huge English WordlistAnd a couple other really nice lists, too.
RIDYHEW!RIDYHEW. The RIDiculouslY Huge English Wordlist. You think you know obscure English words? There are some in here that you've never heard of, for sure. Extracted from gigabytes of documents from around the World Wide Web. If you say RIDYHEW quickly, it sounds like "really damn huge". And that's what it is.So, uh, why?Why does the sun shine? Why do the birds sing? Why do the voices keep speaking in my head?How many words are in RIDYHEW?Right now, about 450,000. There are still a lot more words to find, though. Must collect more words! More words!Seriously? For curiosity's sake, for playing word games and working crosswords, for reveling in the beautiful miracle of language.... if you have to ask. Another good useful reason for the query is that is helps to find errors (most introduced through automation.) I have several queued up to be fixed for the next Web release. What words are included in RIDYHEW? What words are excluded?Words which are eligible for inclusion in RIDYHEW:
What are the other lists up there?ENABLE and SOWPODS are two excellent general purpose word lists especially designed for word game players. They are roughly equivalent to the official SCRABBLETM tournament dictionaries. ENABLE is for US players, while SOWPODS is based off the UK/Commonwealth dictionary.They're public domain, free for download and use in any way you like. Really nice. How about RIDYHEW?Go ahead and use it in your commercial or non-commercial programming projects. All I ask is that you redistribute the original documentation with RIDYHEW.How do you make the list?First off, I used whatever public-domain word lists I could find off the Internet to form a base. Notice that a lot of these lists didn't fit the criteria I line out above, so they required extensive editing before I could add them to RIDYHEW. If you've never gone over 100,000+ words one by one looking for spelling errors and the like, well, you've simply never lived.After that, what I did is I downloaded as many large text files, stories, Usenet postings, web pages, and Word documents that I could find, gigabytes of literature, extracted all of the words out of them, compared them to RIDYHEW and spat out whatever wasn't found. From there, I used automated tools to generate plurals for the nouns, inflections for the verbs, comperative/superlative for the adjectives, etc. The words that were nonsense, misspelt, or otherwise ineligible for the RIDYHEW list I saved in a filter list, so I would never see them again in these output lists I edited. Words used in the works of Shakespeare, Bacon, Milton, Dickens, Twain, Middleton, Alger, and many many others were reviewed in compiling the lists, thanks to Project Gutenberg (they rock!) Frequencygrams were created for unusual, uncommon and obsolete words used by the authors; these were used to eliminate obvious nonce words. I'm actually still going through these documents, and adding words to the list. I'm also actively weeding out errors (there are some) from the RIDYHEW list. So this list is being actively maintained?Yes, it is, which makes it the longest such list available on the Internet. (Actually it's one of the longest word lists, period, on the Internet.)I'm adding new words from a large number of sources, and removing the errors that have crept in. At the moment less than one-half of one percent is bad, which is a lot lot better than any other list you'll find this size. A lot more than 0.5% of the words are obscure, though, so be warned. (What, you didn't think a list with half a million different words wouldn't have some odd-balls? Hee-hee.) If you find an error (either a bad word or a missing word,) please report it so it can be fixed in the next Web Release. There is a thread in our forums devoted to reporting RIDYHEW errors. But please be sure first that the bad word is actually bad - what criteria above does it fail to meet? Any other lists available?You betcha. Along with RIDYHEW, I have lists of given names (masculine and feminine), lists of surnames, and a long list of "never" words."Never" words?Yep, "never" words. When I check a text document against the word dictionary, I also check it against a list full of junk that I "never" want to see again in an output list.What you'd find in the "never" list:
OK, OK, so where do I get the list?The package containing just the RIDYHEW list, the name lists, and this documentation is available here. The package containing the NEVER list is available here. The package containing the source code for the tools used (C/C++ source) is available here. |
|