Codehappy.net
Home * Humor and Games * Code * Music * Mathematics * Discussion * Reading * Site news * Film * Baseball * Verse * Links * Search

* Do not follow this link or you will be banned. It is to catch and ban spambots from walking this site. The Code Lair


Search the Ridiculously Huge English Wordlist

And a couple other really nice lists, too.

Back to the Code Lair.

Match string: Master RIDYHEW list SOWPODS word list ENABLE word list
Match string format:

Letters a-zMatch this character exactly, case insensitive
* (asterisk)Match zero or more characters
? (question mark)Matches to exactly one character
! (exclamation mark)Matches to exactly one vowel (same as [aeiou])
# (pound sign)Matches to exactly one consonant (same as [^aeiou])
[letters]Matches to any one of the letters inside the brackets
[^letters]Matches to any one character except the letters inside the brackets
&AND. In order to match, the word has to satisfy the query on the left and the query on the right.

Queries which match to more than 1000 results will only return the first 1000 matches.

(Current RIDYHEW version: Web Release 1)

View current statistics on the RIDYHEW wordlist.

RIDYHEW!

RIDYHEW. The RIDiculouslY Huge English Wordlist. You think you know obscure English words? There are some in here that you've never heard of, for sure. Extracted from gigabytes of documents from around the World Wide Web. If you say RIDYHEW quickly, it sounds like "really damn huge". And that's what it is.

So, uh, why?

Why does the sun shine? Why do the birds sing? Why do the voices keep speaking in my head?

How many words are in RIDYHEW?

Right now, about 450,000. There are still a lot more words to find, though. Must collect more words! More words!

Seriously? For curiosity's sake, for playing word games and working crosswords, for reveling in the beautiful miracle of language.... if you have to ask.

Another good useful reason for the query is that is helps to find errors (most introduced through automation.) I have several queued up to be fixed for the next Web release.

What words are included in RIDYHEW? What words are excluded?

Words which are eligible for inclusion in RIDYHEW:
  • All common, uncapitalized nouns with plurals.
  • All verbs with all legal inflections (including archaic inflections, in the case of verbs current before the 18th century.)
  • All adjectives with comparative and superlative (except in a case where the comparative or superlative obviously does not apply.)
  • All other parts of speech, with a minimum length of two letters.
  • All spelling variants that can be well attested to in modern literature, including British and U.S. variants.
  • Foreign words, if they have found widespread use in English. These words are usually included with their original plurals/inflections along with the Anglicized ones.
  • Compound words, if they have found demonstrated use without a hyphen.
  • Metric units with all legal prefixes and plurals.
Words which are not eligible for inclusion in RIDYHEW:
  • Words that are generally found only capitalized: proper names, trademarks, etc.
  • Contracted words, which are generally only found with an apostrophe.
  • Compound words that are generally only found with a hyphen.
  • Words of an over-specialized nature, for example, jargon from a small profession or trade.
  • Acronyms, except in the case where they have lost capitalization and are used as ordinary words with parts of speech (snafu, awol for example)
  • "Nonce" words, or words made up for an occasion which appear only a handful of times in literature
  • Foreign words which are rarely found in English literature or conversation.
  • Medical terms found only in Latin (with some exceptions.)
Note that this list is not expurgated beyond the above, which means it contains some pretty filthy and offensive words. Keep that in mind when you form your queries.

What are the other lists up there?

ENABLE and SOWPODS are two excellent general purpose word lists especially designed for word game players. They are roughly equivalent to the official SCRABBLETM tournament dictionaries. ENABLE is for US players, while SOWPODS is based off the UK/Commonwealth dictionary.

They're public domain, free for download and use in any way you like.

Really nice. How about RIDYHEW?

Go ahead and use it in your commercial or non-commercial programming projects. All I ask is that you redistribute the original documentation with RIDYHEW.

How do you make the list?

First off, I used whatever public-domain word lists I could find off the Internet to form a base. Notice that a lot of these lists didn't fit the criteria I line out above, so they required extensive editing before I could add them to RIDYHEW. If you've never gone over 100,000+ words one by one looking for spelling errors and the like, well, you've simply never lived.

After that, what I did is I downloaded as many large text files, stories, Usenet postings, web pages, and Word documents that I could find, gigabytes of literature, extracted all of the words out of them, compared them to RIDYHEW and spat out whatever wasn't found.

From there, I used automated tools to generate plurals for the nouns, inflections for the verbs, comperative/superlative for the adjectives, etc. The words that were nonsense, misspelt, or otherwise ineligible for the RIDYHEW list I saved in a filter list, so I would never see them again in these output lists I edited.

Words used in the works of Shakespeare, Bacon, Milton, Dickens, Twain, Middleton, Alger, and many many others were reviewed in compiling the lists, thanks to Project Gutenberg (they rock!) Frequencygrams were created for unusual, uncommon and obsolete words used by the authors; these were used to eliminate obvious nonce words.

I'm actually still going through these documents, and adding words to the list. I'm also actively weeding out errors (there are some) from the RIDYHEW list.

So this list is being actively maintained?

Yes, it is, which makes it the longest such list available on the Internet. (Actually it's one of the longest word lists, period, on the Internet.)

I'm adding new words from a large number of sources, and removing the errors that have crept in. At the moment less than one-half of one percent is bad, which is a lot lot better than any other list you'll find this size. A lot more than 0.5% of the words are obscure, though, so be warned. (What, you didn't think a list with half a million different words wouldn't have some odd-balls? Hee-hee.)

If you find an error (either a bad word or a missing word,) please report it so it can be fixed in the next Web Release. There is a thread in our forums devoted to reporting RIDYHEW errors. But please be sure first that the bad word is actually bad - what criteria above does it fail to meet?

Any other lists available?

You betcha. Along with RIDYHEW, I have lists of given names (masculine and feminine), lists of surnames, and a long list of "never" words.

"Never" words?

Yep, "never" words. When I check a text document against the word dictionary, I also check it against a list full of junk that I "never" want to see again in an output list.

What you'd find in the "never" list:

  • A very large number of names (given names and surnames, from English and many other languages)
  • Trademarks, of every different kind of thing you can imagine
  • Names of cities, countries, rivers, counties, provinces, astronomical landmarks, etc., etc.
  • Proper nouns of miscellaneous classification
  • All common misspellings of English words
  • HTML tags, Javascript tokens, XML tags, etc.
  • Keywords and common function calls in various programming languages
  • Nonce and non-words
  • Words without parts of speech
  • Thousands upon thousands of acronyms
  • Excessively technical or jargony words
  • Foreign words of all sort and description
  • File extensions, user names, handles, words run together, etc.
  • Any manner of assorted weird junk!
Compiled from gigabytes upon gigabytes of text files, Usenet messages, web pages, etc.

OK, OK, so where do I get the list?

The package containing just the RIDYHEW list, the name lists, and this documentation is available here. The package containing the NEVER list is available here. The package containing the source code for the tools used (C/C++ source) is available here.
Home Miscellaneous wonderments and fun stuff Get yer code here! Our discussion forums Math papers, puzzles, facts, trivia and oddments My sequences of original and classic music Novel excerpts and short stories Iambic pentameter verse plays! Yes! Site news and annoucements My baseball related page Rather flippant film reviews, from an unique viewpoint An index of all the special extras on this site Links to divers interesting pages all over the WWW Looking for something specific? You should find it here Send me mail if you really want to


visits since 3:33 PM PST 11 Jan 2003