« The Dilbert Strategy | Bruiser »

Active Vocabulary

| | Comments () | TrackBacks (0) | Digg This | Facebook

Mark Liberman discusses how hard it is to count a language's active vocabulary:

Consider the counting problem with respect to the text of your question. Your note uses the strings language, languages, language's. The word-count tool in MS Word will (sensibly enough) count each of these as one "word". But how many different vocabulary items -- word types -- are they? Are these three items, just as written? Or should we count the noun language plus the plural marker -s and the possessive 's? Or should we just count one item language, which happens to occur in three forms?

Your question also includes the strings am, are, be, is, was -- are these five distinct vocabulary items, or five forms of the one verb be? How about the strings weeks, weekly, day, daily? Is weekly the same vocabulary item as an adjective ("on a weekly basis") and an adverb ("published weekly")? If we analyze weekly as week + -ly and significantly as significant + -ly, are those (sometimes or always) the same -ly?

What about the noun use (in "daily use") and the participle used ("used on a daily basis"). Are those different words, or different forms of the same word? Is the participle used the same item, as a whole or in parts, as the preterite used?

Should we unpack 90% as "ninety percent" (two words) or "ninety per cent" (three words)? And is percentage a completely different vocabulary item, or is it percent (or per + cent) + -age?

Depending on the answers to these five easy questions about 17 character strings, we might count as many as 18 vocabulary items or as few as 10. And as we scan more text, this spread will grow, without any obvious bounds.

The whole article is worth a look.

0 TrackBacks

Listed below are links to blogs that reference this entry: Active Vocabulary.

TrackBack URL for this entry: http://treehold.com/cgi/mt04/mt-tb.cgi/18

Comments

Reading List

About this Entry

This page contains a single entry by jt published on March 31, 2008 8:44 AM.

The Dilbert Strategy was the previous entry in this blog.

Bruiser is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Older content can be found at the original Compassionate Curmudgeon site.