|
|
|
|
|
|
|
<HTML> |
|
<HEAD> |
|
<TITLE>WNDB(5WN) manual page</TITLE> |
|
</HEAD> |
|
<BODY> |
|
<A HREF="#toc">Table of Contents</A><P> |
|
|
|
<H2><A NAME="sect0" HREF="#toc0">NAME </A></H2> |
|
index.noun, data.noun, index.verb, data.verb, index.adj, data.adj, index.adv, |
|
data.adv - WordNet database files <P> |
|
noun.exc, verb.exc. adj.exc adv.exc - morphology |
|
exception lists <P> |
|
sentidx.vrb, sents.vrb - files used by search code to display |
|
sentences illustrating the use of some specific verbs |
|
<H2><A NAME="sect1" HREF="#toc1">DESCRIPTION </A></H2> |
|
For |
|
each syntactic category, two files are needed to represent the contents |
|
of the WordNet database - <B>index. </B><I>pos </I> and <B>data. </B><I>pos </I>, where <I>pos </I> is <B>noun |
|
</B>, <B>verb </B>, <B>adj </B> and <B>adv </B>. The other auxiliary files are used by the WordNet |
|
library's searching functions and are needed to run the various WordNet |
|
browsers. <P> |
|
Each index file is an alphabetized list of all the words found |
|
in WordNet in the corresponding part of speech. On each line, following |
|
the word, is a list of byte offsets (<I>synset_offset </I>s) in the corresponding |
|
data file, one for each synset containing the word. Words in the index |
|
file are in lower case only, regardless of how they were entered in the |
|
lexicographer files. This folds various orthographic representations of |
|
the word into one line enabling database searches to be case insensitive. |
|
See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A> |
|
for a detailed description of the lexicographer files |
|
<P> |
|
A data file for a syntactic category contains information corresponding |
|
to the synsets that were specified in the lexicographer files, with relational |
|
pointers resolved to <I>synset_offset </I>s. Each line corresponds to a synset. |
|
Pointers are followed and hierarchies traversed by moving from one synset |
|
to another via the <I>synset_offset </I>s. <P> |
|
The exception list files, <I>pos </I><B>.exc |
|
</B>, are used to help the morphological processor find base forms from irregular |
|
inflections. <P> |
|
The files <B>sentidx.vrb </B> and <B>sents.vrb </B> contain sentences illustrating |
|
the use of specific senses of some verbs. These files are used by the |
|
searching software in response to a request for verb sentence frames. |
|
Generic sentence frames are displayed when an illustrative sentence is |
|
not present. <P> |
|
The various database files are in ASCII formats that are |
|
easily read by both humans and machines. All fields, unless otherwise |
|
noted, are separated by one space character, and all lines are terminated |
|
by a newline character. Fields enclosed in italicized square brackets |
|
may not be present. <P> |
|
See <B><A HREF="wngloss.7WN.html">wngloss</B>(7WN)</A> |
|
for a glossary of WordNet terminology |
|
and a discussion of the database's content and logical organization. |
|
<H3><A NAME="sect2" HREF="#toc2">Index |
|
File Format </A></H3> |
|
Each index file begins with several lines containing a copyright |
|
notice, version number and license agreement. These lines all begin with |
|
two spaces and the line number so they do not interfere with the binary |
|
search algorithm that is used to look up entries in the index files. All |
|
other lines are in the following format. In the field descriptions, <B>number |
|
</B> always refers to a decimal integer unless otherwise defined. <P> |
|
<I>lemma pos synset_cnt p_cnt [ptr_symbol...] sense_cnt tagsense_cnt |
|
synset_offset [synset_offset...] </I> <BR> |
|
<P> |
|
|
|
<DL> |
|
|
|
<DT><I>lemma</I> </DT> |
|
<DD>lower case ASCII text of word |
|
or collocation. Collocations are formed by joining individual words with |
|
an underscore (<B>_ </B>) character. </DD> |
|
|
|
<DT><I>pos</I> </DT> |
|
<DD>Syntactic category: <B>n </B> for noun files, |
|
<B>v </B> for verb files, <B>a </B> for adjective files, <B>r </B> for adverb files. </DD> |
|
</DL> |
|
<P> |
|
<P> |
|
All remaining |
|
fields are with respect to senses of <I>lemma </I> in <I>pos </I>. <P> |
|
|
|
<DL> |
|
|
|
<DT><I>synset_cnt</I> </DT> |
|
<DD>Number |
|
of synsets that <I>lemma </I> is in. This is the number of senses of the word |
|
in WordNet. See <FONT SIZE=-1><B>Sense Numbers </B></FONT> |
|
below for a discussion of how sense numbers |
|
are assigned and the order of <I>synset_offset </I>s in the index files. </DD> |
|
|
|
<DT><I>p_cnt</I> |
|
</DT> |
|
<DD>Number of different pointers that <I>lemma </I> has in all synsets containing |
|
it. </DD> |
|
|
|
<DT><I>ptr_symbol</I> </DT> |
|
<DD>A space separated list of <I>p_cnt </I> different types of pointers |
|
that <I>lemma </I> has in all synsets containing it. See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A> |
|
for a list |
|
of <I>pointer_symbol </I>s. If all senses of <I>lemma </I> have no pointers, this field |
|
is omitted and <I>p_cnt </I> is <B>0 </B>. </DD> |
|
|
|
<DT><I>sense_cnt</I> </DT> |
|
<DD>Same as <I>sense_cnt </I> above. This |
|
is redundant, but the field was preserved for compatibility reasons. </DD> |
|
|
|
<DT><I>tagsense_cnt</I> |
|
</DT> |
|
<DD>Number of senses of <I>lemma </I> that are ranked according to their frequency |
|
of occurrence in semantic concordance texts. </DD> |
|
|
|
<DT><I>synset_offset</I> </DT> |
|
<DD>Byte offset |
|
in <B>data.<I>pos </I></B> file of a synset containing <I>lemma </I>. Each <I>synset_offset </I> in |
|
the list corresponds to a different sense of <I>lemma </I> in WordNet. <I>synset_offset |
|
</I> is an 8 digit, zero-filled decimal integer that can be used with <B><A HREF="fseek.3.html">fseek</B>(3)</A> |
|
|
|
to read a synset from the data file. When passed to <B><A HREF="read_synset.3WN.html">read_synset</B>(3WN)</A> |
|
along |
|
with the syntactic category, a data structure containing the parsed synset |
|
is returned. </DD> |
|
</DL> |
|
|
|
<H3><A NAME="sect3" HREF="#toc3">Data File Format </A></H3> |
|
Each data file begins with several lines |
|
containing a copyright notice, version number and license agreement. These |
|
lines all begin with two spaces and the line number. All other lines are |
|
in the following format. Integer fields are of fixed length, and are zero-filled. |
|
<P> |
|
<I>synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...] p_cnt [ptr...] [frames...] <B>| |
|
</B></I><I> gloss </I> <BR> |
|
<P> |
|
|
|
<DL> |
|
|
|
<DT><I>synset_offset</I> </DT> |
|
<DD>Current byte offset in the file represented |
|
as an 8 digit decimal integer. </DD> |
|
|
|
<DT><I>lex_filenum</I> </DT> |
|
<DD>Two digit decimal integer |
|
corresponding to the lexicographer file name containing the synset. See |
|
<B><A HREF="lexnames.5WN.html">lexnames</B>(5WN)</A> |
|
for the list of filenames and their corresponding numbers. |
|
</DD> |
|
|
|
<DT><I>ss_type</I> </DT> |
|
<DD>One character code indicating the synset type: </DD> |
|
</DL> |
|
<P> |
|
<blockquote><B>n </B><tt> </tt> <tt> </tt> NOUN <BR> |
|
<B>v </B><tt> </tt> <tt> </tt> VERB |
|
<BR> |
|
<B>a </B><tt> </tt> <tt> </tt> ADJECTIVE <BR> |
|
<B>s </B><tt> </tt> <tt> </tt> ADJECTIVE SATELLITE <BR> |
|
<B>r </B><tt> </tt> <tt> </tt> ADVERB <BR> |
|
</blockquote> |
|
|
|
<DL> |
|
|
|
<DT><I>w_cnt</I> </DT> |
|
<DD>Two digit hexadecimal |
|
integer indicating the number of words in the synset. </DD> |
|
|
|
<DT><I>word</I> </DT> |
|
<DD>ASCII form |
|
of a word as entered in the synset by the lexicographer, with spaces replaced |
|
by underscore characters (<B>_ </B>). The text of the word is case sensitive, |
|
in contrast to its form in the corresponding <B>index. </B><I>pos </I> file, that contains |
|
only lower-case forms. In <B>data.adj </B>, a <I>word </I> is followed by a syntactic |
|
marker if one was specified in the lexicographer file. A syntactic marker |
|
is appended, in parentheses, onto <I>word </I> without any intervening spaces. |
|
See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A> |
|
for a list of the syntactic markers for adjectives. </DD> |
|
|
|
<DT><I>lex_id</I> |
|
</DT> |
|
<DD>One digit hexadecimal integer that, when appended onto <I>lemma </I>, uniquely |
|
identifies a sense within a lexicographer file. <I>lex_id </I> numbers usually |
|
start with <B>0 </B>, and are incremented as additional senses of the word are |
|
added to the same file, although there is no requirement that the numbers |
|
be consecutive or begin with <B>0 </B>. Note that a value of <B>0 </B> is the default, |
|
and therefore is not present in lexicographer files. </DD> |
|
|
|
<DT><I>p_cnt</I> </DT> |
|
<DD>Three digit |
|
decimal integer indicating the number of pointers from this synset to |
|
other synsets. If <I>p_cnt </I> is <B>000 </B> the synset has no pointers. </DD> |
|
|
|
<DT><I>ptr</I> </DT> |
|
<DD>A pointer |
|
from this synset to another. <I>ptr </I> is of the form: </DD> |
|
</DL> |
|
<P> |
|
<I>pointer_symbol synset_offset pos source/target |
|
</I> <BR> |
|
<P> |
|
where <I>synset_offset </I> is the byte offset of the target synset in the |
|
data file corresponding to <I>pos </I>. <P> |
|
The <I>source/target </I> field distinguishes |
|
lexical and semantic pointers. It is a four byte field, containing two |
|
two-digit hexadecimal integers. The first two digits indicates the word |
|
number in the current (source) synset, the last two digits indicate the |
|
word number in the target synset. A value of <B>0000 </B> means that <I>pointer_symbol |
|
</I> represents a semantic relation between the current (source) synset and |
|
the target synset indicated by <I>synset_offset </I>. <P> |
|
A lexical relation between |
|
two words in different synsets is represented by non-zero values in the |
|
source and target word numbers. The first and last two bytes of this field |
|
indicate the word numbers in the source and target synsets, respectively, |
|
between which the relation holds. Word numbers are assigned to the <I>word |
|
</I> fields in a synset, from left to right, beginning with <B>1 </B>. <P> |
|
See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A> |
|
|
|
for a list of <I>pointer_symbol </I>s, and semantic and lexical pointer classifications. |
|
|
|
<DL> |
|
|
|
<DT><I>frames</I> </DT> |
|
<DD>In <B>data.verb </B> only, a list of numbers corresponding to the generic |
|
verb sentence frames for <I>word </I>s in the synset. <I>frames </I> is of the form: |
|
</DD> |
|
</DL> |
|
<P> |
|
<I>f_cnt </I> <B>+ </B> <I> f_num w_num [ </I> <B>+ </B> <I> f_num w_num...] </I> <BR> |
|
<P> |
|
where <I>f_cnt </I> a two |
|
digit decimal integer indicating the number of generic frames listed, |
|
<I>f_num </I> is a two digit decimal integer frame number, and <I>w_num </I> is a two |
|
digit hexadecimal integer indicating the word in the synset that the frame |
|
applies to. As with pointers, if this number is <B>00 </B>, <I>f_num </I> applies to |
|
all <I>word </I>s in the synset. If non-zero, it is applicable only to the word |
|
indicated. Word numbers are assigned as described for pointers. Each <I>f_num w_num |
|
</I> pair is preceded by a <B>+ </B>. See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A> |
|
for the text of the generic |
|
sentence frames. |
|
<DL> |
|
|
|
<DT><I>gloss</I> </DT> |
|
<DD>Each synset contains a gloss. A <I>gloss </I> is represented |
|
as a vertical bar (<B>| </B>), followed by a text string that continues until |
|
the end of the line. The gloss may contain a definition, one or more example |
|
sentences, or both. </DD> |
|
</DL> |
|
|
|
<H3><A NAME="sect4" HREF="#toc4">Sense Numbers </A></H3> |
|
Senses in WordNet are generally ordered |
|
from most to least frequently used, with the most common sense numbered |
|
<B>1 </B>. Frequency of use is determined by the number of times a sense is tagged |
|
in the various semantic concordance texts. Senses that are not semantically |
|
tagged follow the ordered senses. The <I>tagsense_cnt </I> field for each entry |
|
in the <B>index.<I>pos </I></B> files indicates how many of the senses in the list have |
|
been tagged. <P> |
|
The <B><A HREF="cntlist.5WN.html">cntlist</B>(5WN)</A> |
|
file provided with the database lists the |
|
number of times each sense is tagged in the semantic concordances. The |
|
data from <B>cntlist </B> is used by <B><A HREF="grind.1WN.html">grind</B>(1WN)</A> |
|
to order the senses of each word. |
|
When the <B>index </B>.<I>pos </I> files are generated, the <I>synset_offset </I>s are output |
|
in sense number order, with sense 1 first in the list. Senses with the |
|
same number of semantic tags are assigned unique but consecutive sense |
|
numbers. The WordNet <FONT SIZE=-1><B>OVERVIEW </B></FONT> |
|
search displays all senses of the specified |
|
word, in all syntactic categories, and indicates which of the senses are |
|
represented in the semantically tagged texts. |
|
<H3><A NAME="sect5" HREF="#toc5">Exception List File Format |
|
</A></H3> |
|
Exception lists are alphabetized lists of inflected forms of words and |
|
their base forms. The first field of each line is an inflected form, followed |
|
by a space separated list of one or more base forms of the word. There |
|
is one exception list file for each syntactic category. <P> |
|
Note that the |
|
noun and verb exception lists were automatically generated from a machine-readable |
|
dictionary, and contain many words that are not in WordNet. Also, for |
|
many of the inflected forms, base forms could be easily derived using |
|
the standard rules of detachment programmed into Morphy (See <B><A HREF="morph.7WN.html">morph</B>(7WN)</A> |
|
). |
|
These anomalies are allowed to remain in the exception list files, as |
|
they do no harm. <P> |
|
|
|
<H3><A NAME="sect6" HREF="#toc6">Verb Example Sentences </A></H3> |
|
For some verb senses, example |
|
sentences illustrating the use of the verb sense can be displayed. Each |
|
line of the file <B>sentidx.vrb </B> contains a <I>sense_key </I> followed by a space |
|
and a comma separated list of example sentence template numbers, in decimal. |
|
The file <B>sents.vrb </B> lists all of the example sentence templates. Each |
|
line begins with the template number followed by a space. The rest of |
|
the line is the text of a template example sentence, with <B>%s </B> used as |
|
a placeholder in the text for the verb. Both files are sorted alphabetically |
|
so that the <I>sense_key </I> and template sentence number can be used as indices, |
|
via <B><A HREF="binsrch.3WN.html">binsrch</B>(3WN)</A> |
|
,<B></B> into the appropriate file. <P> |
|
When a request for <FONT SIZE=-1><B>FRAMES |
|
</B></FONT> |
|
is made, the WordNet search code looks for the sense in <B>sentidx.vrb </B>. |
|
If found, the sentence template(s) listed is retrieved from <B>sents.vrb |
|
</B>, and the <B>%s </B> is replaced with the verb. If the sense is not found, the |
|
applicable generic sentence frame(s) listed in <I>frames </I> is displayed. |
|
<H2><A NAME="sect7" HREF="#toc7">NOTES |
|
</A></H2> |
|
Information in the <B>data.<I>pos </I></B> and <B>index.<I>pos </I></B> files represents all of the |
|
word senses and synsets in the WordNet database. The <I>word </I>, <I>lex_id </I>, and |
|
<I>lex_filenum </I> fields together uniquely identify each word sense in WordNet. |
|
These can be encoded in a <I>sense_key </I> as described in <B><A HREF="senseidx.5WN.html">senseidx</B>(5WN)</A> |
|
. Each |
|
synset in the database can be uniquely identified by combining the <I>synset_offset |
|
</I> for the synset with a code for the syntactic category (since it is possible |
|
for synsets in different <B>data.<I>pos </I></B> files to have the same <I>synset_offset |
|
</I>). <P> |
|
The WordNet system provide both command line and window-based browser |
|
interfaces to the database. Both interfaces utilize a common library of |
|
search and morphology code. The source code for the library and interfaces |
|
is included in the WordNet package. See <B><A HREF="wnintro.3WN.html">wnintro</B>(3WN)</A> |
|
for an overview of |
|
the WordNet source code. |
|
<H2><A NAME="sect8" HREF="#toc8">ENVIRONMENT VARIABLES (UNIX) </A></H2> |
|
|
|
<DL> |
|
|
|
<DT><B>WNHOME</B> </DT> |
|
<DD>Base directory |
|
for WordNet. Default is <B>/usr/local/WordNet-3.0 </B>. </DD> |
|
|
|
<DT><B>WNSEARCHDIR</B> </DT> |
|
<DD>Directory in |
|
which the WordNet database has been installed. Default is <B>WNHOME/dict |
|
</B>. </DD> |
|
</DL> |
|
|
|
<H2><A NAME="sect9" HREF="#toc9">REGISTRY (WINDOWS) </A></H2> |
|
|
|
<DL> |
|
|
|
<DT><B>HKEY_LOCAL_MACHINE\SOFTWARE\WordNet\3.0\WNHome</B> </DT> |
|
<DD>Base directory |
|
for WordNet. Default is <B>C:\Program Files\WordNet\3.0 </B>. </DD> |
|
</DL> |
|
|
|
<H2><A NAME="sect10" HREF="#toc10">FILES </A></H2> |
|
|
|
<DL> |
|
|
|
<DT><B>index.<I>pos </I></B> </DT> |
|
<DD>database |
|
index files </DD> |
|
|
|
<DT><B>data.<I>pos </I></B> </DT> |
|
<DD>database data files </DD> |
|
|
|
<DT><B>*.vrb</B> </DT> |
|
<DD>files of sentences illustrating |
|
the use of verbs </DD> |
|
|
|
<DT><B><I>pos </I>.exc</B> </DT> |
|
<DD>morphology exception lists </DD> |
|
</DL> |
|
|
|
<H2><A NAME="sect11" HREF="#toc11">SEE ALSO </A></H2> |
|
<B><A HREF="grind.1WN.html">grind</B>(1WN)</A> |
|
, |
|
<B><A HREF="wn.1WN.html">wn</B>(1WN)</A> |
|
, <B><A HREF="wnb.1WN.html">wnb</B>(1WN)</A> |
|
, <B><A HREF="wnintro.3WN.html">wnintro</B>(3WN)</A> |
|
, <B><A HREF="binsrch.3WN.html">binsrch</B>(3WN)</A> |
|
, <B><A HREF="wnintro.5WN.html">wnintro</B>(5WN)</A> |
|
, <B><A HREF="cntlist.5WN.html">cntlist</B>(5WN)</A> |
|
, |
|
<B><A HREF="lexnames.5WN.html">lexnames</B>(5WN)</A> |
|
, <B><A HREF="senseidx.5WN.html">senseidx</B>(5WN)</A> |
|
, <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A> |
|
, <B><A HREF="morphy.7WN.html">morphy</B>(7WN)</A> |
|
, <B><A HREF="wngloss.7WN.html">wngloss</B>(7WN)</A> |
|
, |
|
<B><A HREF="wngroups.7WN.html">wngroups</B>(7WN)</A> |
|
, <B><A HREF="wnstats.7WN.html">wnstats</B>(7WN)</A> |
|
. <P> |
|
|
|
<HR><P> |
|
<A NAME="toc"><B>Table of Contents</B></A><P> |
|
<UL> |
|
<LI><A NAME="toc0" HREF="#sect0">NAME</A></LI> |
|
<LI><A NAME="toc1" HREF="#sect1">DESCRIPTION</A></LI> |
|
<UL> |
|
<LI><A NAME="toc2" HREF="#sect2">Index File Format</A></LI> |
|
<LI><A NAME="toc3" HREF="#sect3">Data File Format</A></LI> |
|
<LI><A NAME="toc4" HREF="#sect4">Sense Numbers</A></LI> |
|
<LI><A NAME="toc5" HREF="#sect5">Exception List File Format</A></LI> |
|
<LI><A NAME="toc6" HREF="#sect6">Verb Example Sentences</A></LI> |
|
</UL> |
|
<LI><A NAME="toc7" HREF="#sect7">NOTES</A></LI> |
|
<LI><A NAME="toc8" HREF="#sect8">ENVIRONMENT VARIABLES (UNIX)</A></LI> |
|
<LI><A NAME="toc9" HREF="#sect9">REGISTRY (WINDOWS)</A></LI> |
|
<LI><A NAME="toc10" HREF="#sect10">FILES</A></LI> |
|
<LI><A NAME="toc11" HREF="#sect11">SEE ALSO</A></LI> |
|
</UL> |
|
</BODY></HTML> |
|
|