openie5 / WordNet-3.0 /doc /html /senseidx.5WN.html
seonglae's picture
feat: wordnet 3.0 added for standalone
cb1c1cb
<!-- manual page source format generated by PolyglotMan v3.0.3a12, -->
<!-- available via anonymous ftp from ftp.cs.berkeley.edu:/ucb/people/phelps/tcltk/rman.tar.Z -->
<HTML>
<HEAD>
<TITLE>SENSEIDX(5WN) manual page</TITLE>
</HEAD>
<BODY>
<A HREF="#toc">Table of Contents</A><P>
<H2><A NAME="sect0" HREF="#toc0">NAME </A></H2>
index.sense, sense.idx - WordNet's sense index
<H2><A NAME="sect1" HREF="#toc1">DESCRIPTION </A></H2>
The WordNet
sense index provides an alternate method for accessing synsets and word
senses in the WordNet database. It is useful to applications that retrieve
synsets or other information related to a specific sense in WordNet, rather
than all the senses of a word or collocation. It can also be used with
tools like <B>grep </B> and Perl to find all senses of a word in one or more
parts of speech. A specific WordNet sense, encoded as a <I>sense_key </I>, can
be used as an index into this file to obtain its WordNet sense number,
the database byte offset of the synset containing the sense, and the number
of times it has been tagged in the semantic concordance texts. <P>
Concatenating
the <I>lemma </I> and <I>lex_sense </I> fields of a semantically tagged word (represented
in a <B>&lt;wf&nbsp; </B>...&nbsp;<B>&gt; </B> attribute/value pair) in a semantic concordance file, using
<B>% </B> as the concatenation character, creates the <I>sense_key </I> for that sense,
which can in turn be used to search the sense index file. <P>
A <I>sense_key
</I> is the best way to represent a sense in semantic tagging or other systems
that refer to WordNet senses. <I>sense_key </I>s are independent of WordNet sense
numbers and <I>synset_offset </I>s, which vary between versions of the database.
Using the sense index and a <I>sense_key </I>, the corresponding synset (via
the <I>synset_offset </I>) and WordNet sense number can easily be obtained. A
mapping from noun <I>sense_key </I>s in WordNet 1.6 to corresponding 2.0 <I>sense_key
</I>s is provided with version 2.0, and is described in <B><A HREF="sensemap.5WN.html">sensemap</B>(5WN)</A>
. <P>
See
<B><A HREF="wndb.5WN.html">wndb</B>(5WN)</A>
for a thorough discussion of the WordNet database files.
<H3><A NAME="sect2" HREF="#toc2">File
Format </A></H3>
The sense index file lists all of the senses in the WordNet database
with each line representing one sense. The file is in alphabetical order,
fields are separated by one space, and each line is terminated with a
newline character. <P>
Each line is of the form: <P>
<blockquote><I>sense_key&nbsp;&nbsp;synset_offset&nbsp;&nbsp;sense_number&nbsp;&nbsp;tag_cnt
</I> </blockquote>
<P>
<I>sense_key </I> is an encoding of the word sense. Programs can construct
a sense key in this format and use it as a binary search key into the
sense index file. The format of a <I>sense_key </I> is described below. <P>
<I>synset_offset
</I> is the byte offset that the synset containing the sense is found at in
the database "data" file corresponding to the part of speech encoded in
the <I>sense_key </I>. <I>synset_offset </I> is an 8 digit, zero-filled decimal integer,
and can be used with <B><A HREF="fseek.3.html">fseek</B>(3)</A>
to read a synset from the data file. When
passed to the WordNet library function <B>read_synset() </B> along with the syntactic
category, a data structure containing the parsed synset is returned. <P>
<I>sense_number
</I> is a decimal integer indicating the sense number of the word, within
the part of speech encoded in <I>sense_key </I>, in the WordNet database. See
<B><A HREF="wndb.5WN.html">wndb</B>(5WN)</A>
for information about how sense numbers are assigned. <P>
<I>tag_cnt
</I> represents the decimal number of times the sense is tagged in various
semantic concordance texts. A <I>tag_cnt </I> of <B>0 </B> indicates that the sense
has not been semantically tagged.
<H3><A NAME="sect3" HREF="#toc3">Sense Key Encoding </A></H3>
A <I>sense_key </I> is represented
as: <P>
<blockquote><I>lemma </I><B>% </B><I>lex_sense </I> </blockquote>
<P>
where <I>lex_sense </I> is encoded as: <P>
<blockquote><I>ss_type<B>:<I>lex_filenum<B>:<I>lex_id<B>:<I>head_word<B>:<I>head_id
</I></B></I></B></I></B></I></B></I> </blockquote>
<P>
<I>lemma </I> is the ASCII text of the word or collocation as found in the
WordNet database index file corresponding to <I>pos </I>. <I>lemma </I> is in lower case,
and collocations are formed by joining individual words with an underscore
(<B>_ </B>) character. <P>
<I>ss_type </I> is a one digit decimal integer representing the
synset type for the sense. See <FONT SIZE=-1><B>Synset Type </B></FONT>
below for a listing of the
numbers corresponding to each synset type. <P>
<I>lex_filenum </I> is a two digit
decimal integer representing the name of the lexicographer file containing
the synset for the sense. See <B><A HREF="lexnames.5WN.html">lexnames</B>(5WN)</A>
for the list of lexicographer
file names and their corresponding numbers. <P>
<I>lex_id </I> is a two digit decimal
integer that, when appended onto <I>lemma </I>, uniquely identifies a sense within
a lexicographer file. <I>lex_id </I> numbers usually start with <B>00 </B>, and are incremented
as additional senses of the word are added to the same file, although
there is no requirement that the numbers be consecutive or begin with
<B>00 </B>. Note that a value of <B>00 </B> is the default, and therefore is not present
in lexicographer files. Only non-default <I>lex_id </I> values must be explicitly
assigned in lexicographer files. See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A>
for information on the
format of lexicographer files. <P>
<I>head_word </I> is only present if the sense
is in an adjective satellite synset. It is the lemma of the first word
of the satellite's head synset. <P>
<I>head_id </I> is a two digit decimal integer
that, when appended onto <I>head_word </I>, uniquely identifies the sense of
<I>head_word </I> within a lexicographer file, as described for <I>lex_id </I>. There
is a value in this field only if <I>head_word </I> is present.
<H3><A NAME="sect4" HREF="#toc4">Synset Type </A></H3>
The
synset type is encoded as follows: <P>
<blockquote><B>1 </B><tt> </tt>&nbsp;<tt> </tt>&nbsp;NOUN <BR>
<B>2 </B><tt> </tt>&nbsp;<tt> </tt>&nbsp;VERB <BR>
<B>3 </B><tt> </tt>&nbsp;<tt> </tt>&nbsp;ADJECTIVE <BR>
<B>4 </B><tt> </tt>&nbsp;<tt> </tt>&nbsp;ADVERB
<BR>
<B>5 </B><tt> </tt>&nbsp;<tt> </tt>&nbsp;ADJECTIVE SATELLITE <BR>
</blockquote>
<H2><A NAME="sect5" HREF="#toc5">NOTES </A></H2>
For non-satellite senses the <I>head_word
</I> and <I>head_id </I> fields have no values, however the field separator character
(<B>: </B>) is present.
<H2><A NAME="sect6" HREF="#toc6">ENVIRONMENT VARIABLES (UNIX) </A></H2>
<DL>
<DT><B>WNHOME</B> </DT>
<DD>Base directory
for WordNet. Default is <B>/usr/local/WordNet-3.0 </B>. </DD>
<DT><B>WNSEARCHDIR</B> </DT>
<DD>Directory in
which the WordNet database has been installed. Default is <B>WNHOME/dict
</B>. </DD>
</DL>
<H2><A NAME="sect7" HREF="#toc7">REGISTRY (WINDOWS) </A></H2>
<DL>
<DT><B>HKEY_LOCAL_MACHINE\SOFTWARE\WordNet\3.0\WNHome</B> </DT>
<DD>Base directory
for WordNet. Default is <B>C:\Program&nbsp;Files\WordNet\3.0 </B>. </DD>
</DL>
<H2><A NAME="sect8" HREF="#toc8">FILES </A></H2>
<DL>
<DT><B>index.sense</B> </DT>
<DD>sense
index </DD>
</DL>
<H2><A NAME="sect9" HREF="#toc9">SEE ALSO </A></H2>
<B><A HREF="binsrch.3WN.html">binsrch</B>(3WN)</A>
, <B><A HREF="wnsearch.3WN.html">wnsearch</B>(3WN)</A>
, <B><A HREF="lexnames.5WN.html">lexnames</B>(5WN)</A>
, <B><A HREF="wnintro.5WN.html">wnintro</B>(5WN)</A>
,
<B><A HREF="sensemap.5WN.html">sensemap</B>(5WN)</A>
, <B><A HREF="wndb.5WN.html">wndb</B>(5WN)</A>
, <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A>
. <P>
<HR><P>
<A NAME="toc"><B>Table of Contents</B></A><P>
<UL>
<LI><A NAME="toc0" HREF="#sect0">NAME</A></LI>
<LI><A NAME="toc1" HREF="#sect1">DESCRIPTION</A></LI>
<UL>
<LI><A NAME="toc2" HREF="#sect2">File Format</A></LI>
<LI><A NAME="toc3" HREF="#sect3">Sense Key Encoding</A></LI>
<LI><A NAME="toc4" HREF="#sect4">Synset Type</A></LI>
</UL>
<LI><A NAME="toc5" HREF="#sect5">NOTES</A></LI>
<LI><A NAME="toc6" HREF="#sect6">ENVIRONMENT VARIABLES (UNIX)</A></LI>
<LI><A NAME="toc7" HREF="#sect7">REGISTRY (WINDOWS)</A></LI>
<LI><A NAME="toc8" HREF="#sect8">FILES</A></LI>
<LI><A NAME="toc9" HREF="#sect9">SEE ALSO</A></LI>
</UL>
</BODY></HTML>