|
|
|
|
|
|
|
<HTML> |
|
<HEAD> |
|
<TITLE>SENSEIDX(5WN) manual page</TITLE> |
|
</HEAD> |
|
<BODY> |
|
<A HREF="#toc">Table of Contents</A><P> |
|
|
|
<H2><A NAME="sect0" HREF="#toc0">NAME </A></H2> |
|
index.sense, sense.idx - WordNet's sense index |
|
<H2><A NAME="sect1" HREF="#toc1">DESCRIPTION </A></H2> |
|
The WordNet |
|
sense index provides an alternate method for accessing synsets and word |
|
senses in the WordNet database. It is useful to applications that retrieve |
|
synsets or other information related to a specific sense in WordNet, rather |
|
than all the senses of a word or collocation. It can also be used with |
|
tools like <B>grep </B> and Perl to find all senses of a word in one or more |
|
parts of speech. A specific WordNet sense, encoded as a <I>sense_key </I>, can |
|
be used as an index into this file to obtain its WordNet sense number, |
|
the database byte offset of the synset containing the sense, and the number |
|
of times it has been tagged in the semantic concordance texts. <P> |
|
Concatenating |
|
the <I>lemma </I> and <I>lex_sense </I> fields of a semantically tagged word (represented |
|
in a <B><wf </B>... <B>> </B> attribute/value pair) in a semantic concordance file, using |
|
<B>% </B> as the concatenation character, creates the <I>sense_key </I> for that sense, |
|
which can in turn be used to search the sense index file. <P> |
|
A <I>sense_key |
|
</I> is the best way to represent a sense in semantic tagging or other systems |
|
that refer to WordNet senses. <I>sense_key </I>s are independent of WordNet sense |
|
numbers and <I>synset_offset </I>s, which vary between versions of the database. |
|
Using the sense index and a <I>sense_key </I>, the corresponding synset (via |
|
the <I>synset_offset </I>) and WordNet sense number can easily be obtained. A |
|
mapping from noun <I>sense_key </I>s in WordNet 1.6 to corresponding 2.0 <I>sense_key |
|
</I>s is provided with version 2.0, and is described in <B><A HREF="sensemap.5WN.html">sensemap</B>(5WN)</A> |
|
. <P> |
|
See |
|
<B><A HREF="wndb.5WN.html">wndb</B>(5WN)</A> |
|
for a thorough discussion of the WordNet database files. |
|
<H3><A NAME="sect2" HREF="#toc2">File |
|
Format </A></H3> |
|
The sense index file lists all of the senses in the WordNet database |
|
with each line representing one sense. The file is in alphabetical order, |
|
fields are separated by one space, and each line is terminated with a |
|
newline character. <P> |
|
Each line is of the form: <P> |
|
<blockquote><I>sense_key synset_offset sense_number tag_cnt |
|
</I> </blockquote> |
|
<P> |
|
<I>sense_key </I> is an encoding of the word sense. Programs can construct |
|
a sense key in this format and use it as a binary search key into the |
|
sense index file. The format of a <I>sense_key </I> is described below. <P> |
|
<I>synset_offset |
|
</I> is the byte offset that the synset containing the sense is found at in |
|
the database "data" file corresponding to the part of speech encoded in |
|
the <I>sense_key </I>. <I>synset_offset </I> is an 8 digit, zero-filled decimal integer, |
|
and can be used with <B><A HREF="fseek.3.html">fseek</B>(3)</A> |
|
to read a synset from the data file. When |
|
passed to the WordNet library function <B>read_synset() </B> along with the syntactic |
|
category, a data structure containing the parsed synset is returned. <P> |
|
<I>sense_number |
|
</I> is a decimal integer indicating the sense number of the word, within |
|
the part of speech encoded in <I>sense_key </I>, in the WordNet database. See |
|
<B><A HREF="wndb.5WN.html">wndb</B>(5WN)</A> |
|
for information about how sense numbers are assigned. <P> |
|
<I>tag_cnt |
|
</I> represents the decimal number of times the sense is tagged in various |
|
semantic concordance texts. A <I>tag_cnt </I> of <B>0 </B> indicates that the sense |
|
has not been semantically tagged. |
|
<H3><A NAME="sect3" HREF="#toc3">Sense Key Encoding </A></H3> |
|
A <I>sense_key </I> is represented |
|
as: <P> |
|
<blockquote><I>lemma </I><B>% </B><I>lex_sense </I> </blockquote> |
|
<P> |
|
where <I>lex_sense </I> is encoded as: <P> |
|
<blockquote><I>ss_type<B>:<I>lex_filenum<B>:<I>lex_id<B>:<I>head_word<B>:<I>head_id |
|
</I></B></I></B></I></B></I></B></I> </blockquote> |
|
<P> |
|
<I>lemma </I> is the ASCII text of the word or collocation as found in the |
|
WordNet database index file corresponding to <I>pos </I>. <I>lemma </I> is in lower case, |
|
and collocations are formed by joining individual words with an underscore |
|
(<B>_ </B>) character. <P> |
|
<I>ss_type </I> is a one digit decimal integer representing the |
|
synset type for the sense. See <FONT SIZE=-1><B>Synset Type </B></FONT> |
|
below for a listing of the |
|
numbers corresponding to each synset type. <P> |
|
<I>lex_filenum </I> is a two digit |
|
decimal integer representing the name of the lexicographer file containing |
|
the synset for the sense. See <B><A HREF="lexnames.5WN.html">lexnames</B>(5WN)</A> |
|
for the list of lexicographer |
|
file names and their corresponding numbers. <P> |
|
<I>lex_id </I> is a two digit decimal |
|
integer that, when appended onto <I>lemma </I>, uniquely identifies a sense within |
|
a lexicographer file. <I>lex_id </I> numbers usually start with <B>00 </B>, and are incremented |
|
as additional senses of the word are added to the same file, although |
|
there is no requirement that the numbers be consecutive or begin with |
|
<B>00 </B>. Note that a value of <B>00 </B> is the default, and therefore is not present |
|
in lexicographer files. Only non-default <I>lex_id </I> values must be explicitly |
|
assigned in lexicographer files. See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A> |
|
for information on the |
|
format of lexicographer files. <P> |
|
<I>head_word </I> is only present if the sense |
|
is in an adjective satellite synset. It is the lemma of the first word |
|
of the satellite's head synset. <P> |
|
<I>head_id </I> is a two digit decimal integer |
|
that, when appended onto <I>head_word </I>, uniquely identifies the sense of |
|
<I>head_word </I> within a lexicographer file, as described for <I>lex_id </I>. There |
|
is a value in this field only if <I>head_word </I> is present. |
|
<H3><A NAME="sect4" HREF="#toc4">Synset Type </A></H3> |
|
The |
|
synset type is encoded as follows: <P> |
|
<blockquote><B>1 </B><tt> </tt> <tt> </tt> NOUN <BR> |
|
<B>2 </B><tt> </tt> <tt> </tt> VERB <BR> |
|
<B>3 </B><tt> </tt> <tt> </tt> ADJECTIVE <BR> |
|
<B>4 </B><tt> </tt> <tt> </tt> ADVERB |
|
<BR> |
|
<B>5 </B><tt> </tt> <tt> </tt> ADJECTIVE SATELLITE <BR> |
|
</blockquote> |
|
|
|
<H2><A NAME="sect5" HREF="#toc5">NOTES </A></H2> |
|
For non-satellite senses the <I>head_word |
|
</I> and <I>head_id </I> fields have no values, however the field separator character |
|
(<B>: </B>) is present. |
|
<H2><A NAME="sect6" HREF="#toc6">ENVIRONMENT VARIABLES (UNIX) </A></H2> |
|
|
|
<DL> |
|
|
|
<DT><B>WNHOME</B> </DT> |
|
<DD>Base directory |
|
for WordNet. Default is <B>/usr/local/WordNet-3.0 </B>. </DD> |
|
|
|
<DT><B>WNSEARCHDIR</B> </DT> |
|
<DD>Directory in |
|
which the WordNet database has been installed. Default is <B>WNHOME/dict |
|
</B>. </DD> |
|
</DL> |
|
|
|
<H2><A NAME="sect7" HREF="#toc7">REGISTRY (WINDOWS) </A></H2> |
|
|
|
<DL> |
|
|
|
<DT><B>HKEY_LOCAL_MACHINE\SOFTWARE\WordNet\3.0\WNHome</B> </DT> |
|
<DD>Base directory |
|
for WordNet. Default is <B>C:\Program Files\WordNet\3.0 </B>. </DD> |
|
</DL> |
|
|
|
<H2><A NAME="sect8" HREF="#toc8">FILES </A></H2> |
|
|
|
<DL> |
|
|
|
<DT><B>index.sense</B> </DT> |
|
<DD>sense |
|
index </DD> |
|
</DL> |
|
|
|
<H2><A NAME="sect9" HREF="#toc9">SEE ALSO </A></H2> |
|
<B><A HREF="binsrch.3WN.html">binsrch</B>(3WN)</A> |
|
, <B><A HREF="wnsearch.3WN.html">wnsearch</B>(3WN)</A> |
|
, <B><A HREF="lexnames.5WN.html">lexnames</B>(5WN)</A> |
|
, <B><A HREF="wnintro.5WN.html">wnintro</B>(5WN)</A> |
|
, |
|
<B><A HREF="sensemap.5WN.html">sensemap</B>(5WN)</A> |
|
, <B><A HREF="wndb.5WN.html">wndb</B>(5WN)</A> |
|
, <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A> |
|
. <P> |
|
|
|
<HR><P> |
|
<A NAME="toc"><B>Table of Contents</B></A><P> |
|
<UL> |
|
<LI><A NAME="toc0" HREF="#sect0">NAME</A></LI> |
|
<LI><A NAME="toc1" HREF="#sect1">DESCRIPTION</A></LI> |
|
<UL> |
|
<LI><A NAME="toc2" HREF="#sect2">File Format</A></LI> |
|
<LI><A NAME="toc3" HREF="#sect3">Sense Key Encoding</A></LI> |
|
<LI><A NAME="toc4" HREF="#sect4">Synset Type</A></LI> |
|
</UL> |
|
<LI><A NAME="toc5" HREF="#sect5">NOTES</A></LI> |
|
<LI><A NAME="toc6" HREF="#sect6">ENVIRONMENT VARIABLES (UNIX)</A></LI> |
|
<LI><A NAME="toc7" HREF="#sect7">REGISTRY (WINDOWS)</A></LI> |
|
<LI><A NAME="toc8" HREF="#sect8">FILES</A></LI> |
|
<LI><A NAME="toc9" HREF="#sect9">SEE ALSO</A></LI> |
|
</UL> |
|
</BODY></HTML> |
|
|