Synonyms
Home -->
Synonyms --> WordNet
WordNet is a semantic
lexicon for
the English language. It groups
English words into sets of
synonyms called
synsets, provides
short, general definitions, and
records the various semantic
relations between these synonym
sets. The purpose is twofold: to
produce a combination of
dictionary and thesaurus that is
more intuitively usable, and to
support automatic text analysis
and artificial intelligence
applications. The database and
software tools have been released
under a BSD style license and can
be downloaded and used freely. The
database can also be browsed
online.
WordNet was created and is being
maintained at the Cognitive
Science Laboratory of Princeton
University under the direction of
psychology professor George A.
Miller. Development began in 1985.
Over the years, the project
received about $3 million of
funding, mainly from government
agencies interested in machine
translation.
Contents
Database contents
As of 2006, the database contains
about 150,000 words organized in
over 115,000 synsets for a total
of 207,000 word-sense pairs; in
compressed form, it is about 12
megabytes in size.
WordNet distinguishes between
nouns, verbs, adjectives and
adverbs because they follow
different grammatical rules. Every
synset contains a group of
synonymous words or collocations
(a collocation is a sequence of
words that go together to form a
specific meaning, such as "car
pool"); different senses of a word
are in different synsets. The
meaning of the synsets is further
clarified with short defining
glosses (Definitions and/or
example sentences). A typical
example synset with gloss is:
good, right, ripe -- (most
suitable or right for a particular
purpose; "a good time to plant
tomatoes"; "the right time to
act"; "the time is ripe for great
sociological changes")
Most synsets are connected to
other synsets via a number of
semantic relations. These
relations vary based on the type
of word, and include:
-
Nouns
o hypernyms: Y is a hypernym of
X if every X is a (kind of) Y
o hyponyms: Y is a hyponym of X
if every Y is a (kind of) X
o coordinate terms: Y is a
coordinate term of X if X and Y
share a hypernym
o holonym: Y is a holonym of X
if X is a part of Y
o meronym: Y is a meronym of X
if Y is a part of X
-
Verbs
o hypernym: the verb Y is a
hypernym of the verb X if the
activity X is a (kind of) Y
(travel to movement)
o troponym: the verb Y is a
troponym of the verb X if the
activity Y is doing X in some
manner (lisp to talk)
o entailment: the verb Y is
entailed by X if by doing X you
must be doing Y (sleeping by
snoring)
o coordinate terms: those verbs
sharing a common hypernym
-
Adjectives
o related nouns
o participle of verb
-
Adverbs
o root adjectives
While semantic relations apply to
all members of a synset because
they share a meaning but are all
mutually synonyms, words can also
be connected to other words
through lexical relations,
including antonyms (opposites of
each other) and derivationally
related, as well.
WordNet also provides the polysemy
count of a word: the number of
synsets that contain the word. If
a word participates in several
synsets (i.e. has several senses)
then typically some senses are
much more common than others.
WordNet quantifies this by the
frequency score: in which several
sample texts have all words
semantically tagged with the
corresponding synset, and then a
count provided indicating how
often a word appears in a specific
sense.
The morphology functions of the
software distributed with the
database try to deduce the lemma
or root form of a word from the
user's input; only the root form
is stored in the database unless
it has irregular inflected forms.
Knowledge structure
Both nouns and verbs are organized
into hierarchies, defined by
hypernym or IS A relationships.
For instance, the first sense of
the word dog would have the
following hypernym hierarchy; the
words at the same level are
synonyms of each other: some sense
of dog is synonymous with some
other senses of domestic dog and
Canis familiaris, and so on. Each
set of synonyms (synset), has a
unique index and shares its
properties, such as a gloss (or
dictionary) definition.
dog, domestic dog, Canis
familiaris
=> canine, canid
=> carnivore
=> placental, placental mammal,
eutherian, eutherian mammal
=> mammal
=> vertebrate, craniate
=> chordate
=> animal, animate being, beast,
brute, creature, fauna
=> ...
At the top level, these
hierarchies are organized into
base types, 25 primitive groups
for nouns, and 15 for verbs. These
groups form lexicographic files at
a maintenance level. These
primitive groups are connected to
an abstract root node that have,
for some time, been assumed by
various applications that use
WordNet.
In the case of adjectives, the
organization is different. Two
opposite 'head' senses work as
binary poles, while 'satellite'
synonyms connect to each of the
heads via synonymy relations.
Thus, the hierarchies, and the
concept involved with
lexicographic files, do not apply
here the same way they do for
nouns and verbs.
The network of nouns is far deeper
than that of the other parts of
speech. Verbs have a far bushier
structure, and adjectives are
organized into many distinct
clusters. Adverbs are defined in
terms of the adjectives they are
derived from, and thus inherit
their structure from that of the
adjectives.
Psychological justification
The goal of WordNet was to develop
a system that would be consistent
with the knowledge acquired over
the years about how human beings
process language. Anomic aphasia,
for example, creates a condition
that seems to selectively encumber
individuals' ability to name
objects; this makes the decision
to partition the parts of speech
into distinct hierarchies more of
a principled decision than an
arbitrary one.
In the case of hyponymy,
psychological experiments revealed
that individuals can access
properties of nouns more quickly
depending on when a characteristic
becomes a defining property. That
is, individuals can quickly verify
that canaries can sing because a
canary is a songbird (only one
level of hyponymy), but requires
slightly more time to verify that
canaries can fly (two levels of
hyponymy) and even more time to
verify canaries have skin
(multiple levels of hyponymy).
This suggests that we too store
semantic information in a way that
is much like WordNet, because we
only retain the most specific
information needed to
differentiate one particular
concept from similar concepts.
WordNet as an ontology
The hypernym / hyponym relationships
among the noun synsets can be
interpreted as specialization
relations between conceptual
categories. In other words,
WordNet can be interpreted and
used as a lexical ontology in the
computer science sense. However,
such an ontology should normally
be corrected before being used
since it contains hundreds of
basic semantic inconsistencies
such as (i) the existence of
common specializations for
exclusive categories and (ii)
redundancies in the specialization
hierarchy. Furthermore,
transforming WordNet into a
lexical ontology usable for
knowledge representation should
normally also involve (i)
distinguishing the specialization
relations into subtypeOf and
instanceOf relations, and (ii)
associating intuitive unique
identifiers to each category.
Although such corrections and
transformations have been
performed and documented as part
of the integration of WordNet 1.7
into the cooperatively updatable
knowledge base of WebKB-2, most
projects claiming to re-use
WordNet for knowledge-based
applications (typically,
knowledge-oriented information
retrieval) simply re-use it as
such.
A prominent example of using
WordNet, as it is, as an ontology
is to determine the similarity
between words. Various algorithms
have been proposed, and these
include considering the distance
between the conceptual categories
of these words, as well as
considering the hierarchical
structure of the WordNet ontology.
A number of these WordNet-based
word similarity algorithms are
implemented in a Perl package
called WordNet::Similarity.
See the related projects section
for more.
Limitations
Unlike other dictionaries, WordNet
does not include information about
etymology, pronunciation and the
forms of irregular verbs and
contains only limited information
about usage.
The actual lexicographical and
semantical information is
maintained in lexicographer files,
which are then processed by a tool
called grind to produce the
distributed database. Both grind
and the lexicographer files are
freely available in a separate
distribution, but modifying and
maintaining the database requires
expertise.
Though WordNet contains a
sufficient wide range of common
words, it does not cover special
domain vocabulary. Since it is
primarily designed to act as an
underlying database for different
applications, those applications
cannot be used in specific domains
that are not covered by WordNet.
Interfaces
The Jawbone project provides a
Java API to the WordNet 2.1 and
3.0 data. The source code is
released under the MIT license.
The Natural Language Toolkit
provides a Python API to the
WordNet 3.0.
Related projects
A project at Brown University
started by Jeff Stibel, James A.
Anderson, Steve Reiss and others
called Applied Cognition Lab
created a disambiguator using
WordNet in 1998.[2] The project
later morphed into a company
called Simpli, which is now owned
by ValueClick. George Miller
joined the Company as a member of
the Advisory Board. Simpli built
an Internet search engine that
utilized a knowledgebase
principally based on WordNet to
disambiguate and expand keywords
and synsets to help retrieve
information online. WordNet was
expanded upon to add increased
dimensionality, such as
intentionality (used for x),
people (Britney Spears) and
colloquial terminology more
relevant to Internet search (i.e.,
blogging, ecommerce). Neural
network algorithms searched the
expanded WordNet for related terms
to disambiguate search keywords
(Java, in the sense of coffee) and
expand the search synset (Coffee,
Drink, Joe) to improve search
engine results.[3] Before the
company was acquired, it performed
searches across search engines
such as Google, Yahoo!, Ask.com
and others.[4]
The project EuroWordNet has
produced WordNets for several
European languages and linked them
together; these are not freely
available however. The Global
Wordnet project attempts to
coordinate the production and
linking of wordnets for all
languages. Oxford University
Press, the publishers of the
Oxford English Dictionary have
voiced plans to produce their own
online WordNet.
The eXtended WordNet is a project
at the University of Texas at
Dallas which aims to improve
WordNet by semantically parsing
the glosses, thus making the
information contained in these
definitions available for
automatic knowledge processing
systems. It is also freely
available under a license similar
to WordNet's.
The GCIDE project produces a
dictionary by combining a public
domain Webster's Dictionary from
1913 with some WordNet definitions
and material provided by
volunteers. It is released under
the copyleft license GPL.
WordNet is also commonly re-used
via mappings between the WordNet
categories and the categories from
other ontologies. Most often, only
the top-level categories of
WordNet are mapped. However, the
authors of the SUMO ontology have
produced a mapping between all of
the WordNet synsets, (including
nouns, verbs, adjectives and
adverbs), and SUMO classes. The
most recent addition of the
mappings provides links to all of
the more specific terms in the MId-Level
Ontology (MILO), which extends
SUMO. The OpenCyc upper ontology
is also linked to some of WordNet.
In most works that claim to have
integrated WordNet into other
ontologies, the content of WordNet
has not simply been corrected when
semantic problems have been
encountered; instead, WordNet has
been used as an inspiration source
but heavily re-interpreted and
updated whenever suitable. This
was the case when, for example,
the top-level ontology of WordNet
was re-structured according to the
OntoClean based approach or when
WordNet was used as a primary
source for constructing the lower
classes of the SENSUS ontology.
FrameNet is a project similar to
WordNet. It consists of a lexicon
which is based on annotating over
100,000 sentences with their
semantic properties. the unit in
focus is the lexical frame, a type
of state or event together with
the properties associated with it.
An independent project titled
wordNet with an initial lowercase
w is an ongoing project to links
words and phrases via a custom Web
crawler. |