Lucene 应用 WordNet 的同义词典实现同义词检索(C#版) | 隔叶黄莺 Yanbin Blog

2010-07-14 | 阅读(2,259)

同义词检索应该很多时候会用得上的，举个简单的例子，我们搜索关键字 good 的时候，与 well 和 fine 等的词条也可能是你想要的结果。这里我们不自己建立同义词库，直接使用 WordNet 的同义词库，本篇介绍 C# 版的实现步骤，还会有续篇--Java 版。

由于 Lucene 是发源于 Java，所以 C# 的应用者就没有 Java 的那么幸福了，Java 版已经有 3.0.2 可下载，C# 的版本还必须从 SVN 库里：https://svn.apache.org/repos/asf/lucene/lucene.net/tags/Lucene.Net_2_9_2/ 才能取到最新的 2.9.2 的源码，二制包还只有 2.0 的。

接下来就是用 VS 来编译它的，不多说。只是注意到在 contrib 目录中有 WordNet.Net 解决方案，这是我们想要的，编译 WordNet.Net 可得到三个可执行文件：

1. Syns2Index.exe 用来根据 WordNet 的同义词库建立同义词索引文件，同义词本身也是通过 Lucene 来查询到的
2. SynLookup.exe 从同义词索引中查找某个词有哪些同义词
3. SynExpand.exe 与 SynLookup 差不多，只是多了个权重值，大概就是同义程度

好啦，有了 Lucene.Net.dll 和上面那三个文件，我们下面来说进一步的步骤：

二. 下载 WordNet 的同义词库

可以从 http://wordnetcode.princeton.edu/3.0/ 下载 WNprolog-3.0.tar.gz 文件。然后解压到某个目录，如 D:\WNprolog-3.0，其中子目录 prolog 中有许多的 pl 文件，下面要用到的就是 wn_s.pl

三. 生成同义词 Lucene 索引

使用命令

Syns2Index.exe d:\WNprolog-3.0\prolog\wn_s.pl syn_index

第二个参数是生成索引的目录，由它来帮你创建该目录，执行时间大约 40 秒。这是顺利的时候，也许你也会根本无法成功，执行 Syns2Index.exe 的时候出现下面的错误：

Unhandled Exception: System.ArgumentException: maxBufferedDocs must at least be 2 when enabled
at Lucene.Net.Index.IndexWriter.SetMaxBufferedDocs(Int32 maxBufferedDocs)
at WorldNet.Net.Syns2Index.Index(String indexDir, IDictionary word2Nums, IDictionary num2Words)
at WorldNet.Net.Syns2Index.Main(String[] args)

莫急，手中有源码，心里不用慌，只要找到 Syns2Index 工程，改动 Syns2Index.cs 文件中的

writer.SetMaxBufferedDocs(writer.GetMaxBufferedDocs() * 2*/); //GetMaxBufferedDocs() 本身就为 0，翻多少倍也是白搭

为

writer.SetMaxBufferedDocs(100); //所以直接改为 100 或大于 2 的数就行

重新使用新编译的 Syns2Index.exe 执行上一条命令即可。成功执行后，可以看到新生成了一个索引目录 syn_index, 约 3 M。

现在可以用另两个命令来测试一下索引文件:

D:\wordnet>SynLookup.exe syn_index hi
Synonyms found for "hi":
hawaii
hello
howdy
hullo
D:\wordnet>SynExpand.exe syn_index hi
Query: hi hawaii^0.9 hello^0.9 howdy^0.9 hullo^0.9

也可以用 Luke - Lucene Index ToolBox 来查看索引，两个字段，syn 和 word，通过 word:hi 就可以搜索到 syn:hawaii hello howdy hullo

四. 使用同义词分析器、过滤器进行检索

相比，Java 程序员要轻松许多，有现成的 lucene-wordnet-3.0.2.jar，里面有一些现在的代码可以用。C# 的那些分析器和过滤器就得自己写了，或许我已走入了一个岔道，但也不算崎岖。

小步骤就不具体描述了，直接上代码，大家从代码中去理解：

同义词引擎接口

using System.Collections.Generic;

namespace Com.Unmi.Searching
{
    /// <summary>
    /// Summary description for ISynonymEngine
    /// </summary>
    public interface ISynonymEngine
    {
        IEnumerable<string> GetSynonyms(string word);
    }
}

using System.Collections.Generic;

namespace Com.Unmi.Searching

{

/// <summary>

/// Summary description for ISynonymEngine

/// </summary>

public interface ISynonymEngine

{

IEnumerable<string> GetSynonyms(string word);

}

同义词引擎实现类

using System.IO;
using System.Collections.Generic;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
using Lucene.Net.Store;

using LuceneDirectory = Lucene.Net.Store.Directory;
using Version = Lucene.Net.Util.Version;

namespace Com.Unmi.Searching
{
    /// <summary>
    /// Summary description for WordNetSynonymEngine
    /// </summary>
    public class WordNetSynonymEngine : ISynonymEngine
    {

        private IndexSearcher searcher;
        private Analyzer analyzer = new StandardAnalyzer();

        //syn_index_directory 为前面用 Syns2Index 生成的同义词索引目录
        public WordNetSynonymEngine(string syn_index_directory)
        {

            LuceneDirectory indexDir = FSDirectory.Open(new DirectoryInfo(syn_index_directory));
            searcher = new IndexSearcher(indexDir, true);
        }

        public IEnumerable<string> GetSynonyms(string word)
        {
            QueryParser parser = new QueryParser(Version.LUCENE_29, "word", analyzer);
            Query query = parser.Parse(word);
            Hits hits = searcher.Search(query);

            //this will contain a list, of lists of words that go together
            List<string> Synonyms = new List<string>();

            for (int i = 0; i < hits.Length(); i++)
            {
                Field[] fields = hits.Doc(i).GetFields("syn");
                foreach (Field field in fields)
                {
                    Synonyms.Add(field.StringValue());
                }
            }

            return Synonyms;
        }
    }
}

using System.IO;

using System.Collections.Generic;

using Lucene.Net.Analysis;

using Lucene.Net.Analysis.Standard;

using Lucene.Net.Documents;

using Lucene.Net.QueryParsers;

using Lucene.Net.Search;

using Lucene.Net.Store;

using LuceneDirectory = Lucene.Net.Store.Directory;

using Version = Lucene.Net.Util.Version;

namespace Com.Unmi.Searching

{

/// <summary>

/// Summary description for WordNetSynonymEngine

/// </summary>

public class WordNetSynonymEngine : ISynonymEngine

{

private IndexSearcher searcher;

private Analyzer analyzer = new StandardAnalyzer();

//syn_index_directory 为前面用 Syns2Index 生成的同义词索引目录

public WordNetSynonymEngine(string syn_index_directory)

{

LuceneDirectory indexDir = FSDirectory.Open(new DirectoryInfo(syn_index_directory));

searcher = new IndexSearcher(indexDir, true);

}

public IEnumerable<string> GetSynonyms(string word)

{

QueryParser parser = new QueryParser(Version.LUCENE_29, "word", analyzer);

Query query = parser.Parse(word);

Hits hits = searcher.Search(query);

//this will contain a list, of lists of words that go together

List<string> Synonyms = new List<string>();

for (int i = 0; i < hits.Length(); i++)

{

Field[] fields = hits.Doc(i).GetFields("syn");

foreach (Field field in fields)

{

Synonyms.Add(field.StringValue());

}

return Synonyms;

}

过滤器，下面的分析器要用到

using System;
using System.Collections.Generic;
using Lucene.Net.Analysis;

namespace Com.Unmi.Searching
{
    /// <summary>
    /// Summary description for SynonymFilter
    /// </summary>
    public class SynonymFilter : TokenFilter
    {
        private Queue<Token> synonymTokenQueue = new Queue<Token>();

        public ISynonymEngine SynonymEngine { get; private set; }

        public SynonymFilter(TokenStream input, ISynonymEngine synonymEngine)
            : base(input)
        {
            if (synonymEngine == null)
                throw new ArgumentNullException("synonymEngine");

            SynonymEngine = synonymEngine;
        }

        public override Token Next()
        {
            // if our synonymTokens queue contains any tokens, return the next one.
            if (synonymTokenQueue.Count > 0)
            {
                return synonymTokenQueue.Dequeue();
            }

            //get the next token from the input stream
            Token token = input.Next();

            //if the token is null, then it is the end of stream, so return null
            if (token == null)
                return null;

            //retrieve the synonyms
            IEnumerable<string> synonyms = SynonymEngine.GetSynonyms(token.TermText());

            //if we don't have any synonyms just return the token
            if (synonyms == null)
            {
                return token;
            }

            //if we do have synonyms, add them to the synonymQueue,
            // and then return the original token
            foreach (string syn in synonyms)
            {
                //make sure we don't add the same word
                if (!token.TermText().Equals(syn))
                {
                    //create the synonymToken
                    Token synToken = new Token(syn, token.StartOffset(),
                              t.EndOffset(), "<SYNONYM>");

                    // set the position increment to zero
                    // this tells lucene the synonym is
                    // in the exact same location as the originating word
                    synToken.SetPositionIncrement(0);

                    //add the synToken to the synonyms queue
                    synonymTokenQueue.Enqueue(synToken);
                }
            }

            //after adding the syn to the queue, return the original token
            return token;
        }
    }
}

using System;

using System.Collections.Generic;

using Lucene.Net.Analysis;

namespace Com.Unmi.Searching

{

/// <summary>

/// Summary description for SynonymFilter

/// </summary>

public class SynonymFilter : TokenFilter

{

private Queue<Token> synonymTokenQueue = new Queue<Token>();

public ISynonymEngine SynonymEngine { get; private set; }

public SynonymFilter(TokenStream input, ISynonymEngine synonymEngine)

: base(input)

{

if (synonymEngine == null)

throw new ArgumentNullException("synonymEngine");

SynonymEngine = synonymEngine;

}

public override Token Next()

{

// if our synonymTokens queue contains any tokens, return the next one.

if (synonymTokenQueue.Count > 0)

{

return synonymTokenQueue.Dequeue();

}

//get the next token from the input stream

Token token = input.Next();

//if the token is null, then it is the end of stream, so return null

if (token == null)

return null;

//retrieve the synonyms

IEnumerable<string> synonyms = SynonymEngine.GetSynonyms(token.TermText());

//if we don't have any synonyms just return the token

if (synonyms == null)

{

return token;

}

//if we do have synonyms, add them to the synonymQueue,

// and then return the original token

foreach (string syn in synonyms)

{

//make sure we don't add the same word

if (!token.TermText().Equals(syn))

{

//create the synonymToken

Token synToken = new Token(syn, token.StartOffset(),

t.EndOffset(), "<SYNONYM>");

// set the position increment to zero

// this tells lucene the synonym is

// in the exact same location as the originating word

synToken.SetPositionIncrement(0);

//add the synToken to the synonyms queue

synonymTokenQueue.Enqueue(synToken);

}

//after adding the syn to the queue, return the original token

return token;

}

分析器，使用了多个过滤器，当然最主要是用到了上面定义的同义词过滤器

using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;

namespace Com.Unmi.Searching
{
    public class SynonymAnalyzer : Analyzer
    {
        public ISynonymEngine SynonymEngine { get; private set; }

        public SynonymAnalyzer(ISynonymEngine engine)
        {
            SynonymEngine = engine;
        }

        public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
        {
            //create the tokenizer
            TokenStream result = new StandardTokenizer(reader);

            //add in filters
            // first normalize the StandardTokenizer
            result = new StandardFilter(result);

            // makes sure everything is lower case
            result = new LowerCaseFilter(result);

            // use the default list of Stop Words, provided by the StopAnalyzer class.
            result = new StopFilter(result, StopAnalyzer.ENGLISH_STOP_WORDS);

            // injects the synonyms.
            result = new SynonymFilter(result, SynonymEngine);

            //return the built token stream.
            return result;
        }
    }
}

using Lucene.Net.Analysis;

using Lucene.Net.Analysis.Standard;

namespace Com.Unmi.Searching

{

public class SynonymAnalyzer : Analyzer

{

public ISynonymEngine SynonymEngine { get; private set; }

public SynonymAnalyzer(ISynonymEngine engine)

{

SynonymEngine = engine;

}

public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)

{

//create the tokenizer

TokenStream result = new StandardTokenizer(reader);

//add in filters

// first normalize the StandardTokenizer

result = new StandardFilter(result);

// makes sure everything is lower case

result = new LowerCaseFilter(result);

// use the default list of Stop Words, provided by the StopAnalyzer class.

result = new StopFilter(result, StopAnalyzer.ENGLISH_STOP_WORDS);

// injects the synonyms.

result = new SynonymFilter(result, SynonymEngine);

//return the built token stream.

return result;

}

最后，当然是要应用上面的同义词引擎和过滤器，分析器了

using System.IO;
using System.Web;
using Lucene.Net.Index;
using System;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using System.Collections.Generic;
using Lucene.Net.Analysis;
using Lucene.Net.Search;
using Lucene.Net.QueryParsers;
using Lucene.Net.Store;
using Version = Lucene.Net.Util.Version;
using System.Collections;
using Lucene.Net.Highlight;

using LuceneDirectory = Lucene.Net.Store.Directory;

namespace Com.Unmi.Searching
{
    public class Searcher
    {
        /// <summary>
        /// 假定前面创建的同义词索引目录是 d:\indexes\syn_index，
        /// 要搜索的内容索引目录是 d:\indexes\file_index, 且索引中有两字段 file 和 content
        /// IndexEntry 是你自己创建的一个搜索结果类，有两属性 file 和 fragment
        /// </summary>
        /// <param name="querystring">queryString</param>
        public static List<IndexEntry> Search(queryString)
        {
            //Now SynonymAnalyzer
            ISynonymEngine synonymEngine = new WordNetSynonymEngine(@"d:\indexes\syn_index");
            Analyzer analyzer = new SynonymAnalyzer(synonymEngine);

            LuceneDirectory indexDir = FSDirectory.Open(new DirectoryInfo(@"d:\indexes\file_index");
            IndexSearcher searcher = new IndexSearcher(indexDir, true);

            QueryParser parser = new QueryParser(Version.LUCENE_29,"content", analyzer);

            Query query = parser.Parse(queryString);

            Hits hits = searcher.Search(query);

   //返回类型是一个 IndexEntry 列表，它有两个属性 file 和 fragment
            List<IndexEntry> entries = new List<IndexEntry>();

            //这里还用到了 Contrib 里的另一个 Lucene 辅助组件，高亮显示搜索关键字
            SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<span style='background-color:#23dc23;color:white'>", "</span>");
            Highlighter highlighter = new Highlighter(simpleHTMLFormatter, new QueryScorer(query)); 

            highlighter.SetTextFragmenter(new SimpleFragmenter(256));
            highlighter.SetMaxDocBytesToAnalyze(int.MaxValue);

            Analyzer standAnalyzer = new StandardAnalyzer();

            for (int i = 0; i < hits.Length(); i++)
            {
                Document doc = hits.Doc(i);

                //Any time, can't use the SynonymAnalyzer here
    //注意，这里不能用前面的 SynonymAnalyzer 实例，否则将会陷入一系列可怕的循环
                string fragment = highlighter.GetBestFragment(standAnalyzer/*analyzer*/, "content", doc.Get("content"));

                IndexEntry entry = new IndexEntry(doc.Get("file"), fragment);
                entries.Add(entry);
            }

            return entries;
        }
    }
}

using System.IO;

using System.Web;

using Lucene.Net.Index;

using System;

using Lucene.Net.Analysis.Standard;

using Lucene.Net.Documents;

using System.Collections.Generic;

using Lucene.Net.Analysis;

using Lucene.Net.Search;

using Lucene.Net.QueryParsers;

using Lucene.Net.Store;

using Version = Lucene.Net.Util.Version;

using System.Collections;

using Lucene.Net.Highlight;

using LuceneDirectory = Lucene.Net.Store.Directory;

namespace Com.Unmi.Searching

{

public class Searcher

{

/// <summary>

/// 假定前面创建的同义词索引目录是 d:\indexes\syn_index，

/// 要搜索的内容索引目录是 d:\indexes\file_index, 且索引中有两字段 file 和 content

/// IndexEntry 是你自己创建的一个搜索结果类，有两属性 file 和 fragment

/// </summary>

/// <param name="querystring">queryString</param>

public static List<IndexEntry> Search(queryString)

{

//Now SynonymAnalyzer

ISynonymEngine synonymEngine = new WordNetSynonymEngine(@"d:\indexes\syn_index");

Analyzer analyzer = new SynonymAnalyzer(synonymEngine);

LuceneDirectory indexDir = FSDirectory.Open(new DirectoryInfo(@"d:\indexes\file_index");

IndexSearcher searcher = new IndexSearcher(indexDir, true);

QueryParser parser = new QueryParser(Version.LUCENE_29,"content", analyzer);

Query query = parser.Parse(queryString);

Hits hits = searcher.Search(query);

//返回类型是一个 IndexEntry 列表，它有两个属性 file 和 fragment

List<IndexEntry> entries = new List<IndexEntry>();

//这里还用到了 Contrib 里的另一个 Lucene 辅助组件，高亮显示搜索关键字

SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<span style='background-color:#23dc23;color:white'>", "</span>");

Highlighter highlighter = new Highlighter(simpleHTMLFormatter, new QueryScorer(query));

highlighter.SetTextFragmenter(new SimpleFragmenter(256));

highlighter.SetMaxDocBytesToAnalyze(int.MaxValue);

Analyzer standAnalyzer = new StandardAnalyzer();

for (int i = 0; i < hits.Length(); i++)

{

Document doc = hits.Doc(i);

//Any time, can't use the SynonymAnalyzer here

//注意，这里不能用前面的 SynonymAnalyzer 实例，否则将会陷入一系列可怕的循环

string fragment = highlighter.GetBestFragment(standAnalyzer/*analyzer*/, "content", doc.Get("content"));

IndexEntry entry = new IndexEntry(doc.Get("file"), fragment);

entries.Add(entry);

}

return entries;

}

五. 看看同义词检索的效果

看前面一大面，也不知道有几人能到达这里，该感性的认识一下，上图看真相：

搜索 ok，由于 fine 是 ok 的同义词，所以也被检索到，要有其他同义的结果也能显示出来的。

参考：

e-使用sandbox的wordnet完成同义词索引
http://www.chencer.com/techno/java/lucene/wordnet.html
lucene connector » org.apache.lucene.wordnet
Lucene.Net – Custom Synonym Analyzer(本文比较多的参考这篇)
Lucene in action 笔记 analysis篇

本文链接 https://yanbin.blog/lucene-wordnet-synonym-search-csharp/, 来自隔叶黄莺 Yanbin Blog

Polo on 想选一种动态语言＋跨平台界面组件的组合，希望大家给点意见Perl + Tkx
best coffee on SciPy 最优化之最小化I wanted to take a moment to commend you on the outstanding quality of...
seetimee on 体验 Python FastAPI 的并发能力及线, 进程模型感谢
Yanbin on Mockito 3.4.0 开始可 Mock 静态方法有一个补救，新写了一篇 https://yanbin.blog/mockito-mock-static-method-in-multiple...
Yanbin on 升级到 Spring Boot 3 后 javax.inject.Named 不可用怎么，被抄袭了！算是被机器翻译引用的？

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30