亚洲一区欧美在线,日韩欧美视频免费观看,色戒的三场床戏分别是在几段,欧美日韩国产在线人成

基于TextRank和簇過濾的林業(yè)文本關(guān)鍵信息抽取研究
作者:
作者單位:

作者簡介:

通訊作者:

中圖分類號(hào):

基金項(xiàng)目:

國家自然科學(xué)基金項(xiàng)目(61772078)和北京林業(yè)大學(xué)熱點(diǎn)追蹤項(xiàng)目(2018BLRD18)


Key Information Extraction of Forestry Text Based on TextRank and Clusters Filtering
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 圖/表
  • |
  • 訪問統(tǒng)計(jì)
  • |
  • 參考文獻(xiàn)
  • |
  • 相似文獻(xiàn)
  • |
  • 引證文獻(xiàn)
  • |
  • 資源附件
  • |
  • 文章評論
    摘要:

    目前,獲取林業(yè)文本關(guān)鍵信息存在2個(gè)問題:關(guān)鍵信息獲取主要從關(guān)鍵詞角度考慮,忽略了詞語的信息類型;網(wǎng)絡(luò)上的林業(yè)文本沒有統(tǒng)一的記述結(jié)構(gòu),詞語信息類型提取困難。為此,本文提出了基于改進(jìn)TextRank和簇過濾的林業(yè)文本關(guān)鍵信息抽取方法,以“關(guān)鍵詞+信息類型”兩部分表示文本關(guān)鍵信息。首先,抽取關(guān)鍵詞并進(jìn)行Word2Vec向量化,然后通過構(gòu)建融合詞語特征值、邊權(quán)值的圖模型對TextRank進(jìn)行改進(jìn),對經(jīng)迭代收斂得到的穩(wěn)定圖進(jìn)行歸并聚類形成簇;然后,設(shè)計(jì)簇品質(zhì)評價(jià)公式進(jìn)行簇過濾,再次應(yīng)用TextRank形成最終簇集合;最后,對簇進(jìn)行信息類型標(biāo)注。對于測試文本,通過比較關(guān)鍵詞向量和簇心向量的距離獲得詞語的信息類型,將信息類型與關(guān)鍵詞結(jié)合得到文本的關(guān)鍵信息。基于2000篇與林業(yè)政策新聞相關(guān)的林業(yè)文本進(jìn)行實(shí)驗(yàn),最終簇集合的緊密度為0.9680,間隔度為0.0572,綜合評價(jià)指標(biāo)為0.8871;對其中400篇文本進(jìn)行關(guān)鍵詞人工標(biāo)注,將本文關(guān)鍵詞抽取方法與TextRank、TF-IDF等6種算法進(jìn)行比較,結(jié)果表明,本文方法在MRR、Bpref、準(zhǔn)確率和綜合評價(jià)指標(biāo)上均獲得了較好的效果,說明本文方法在提取林業(yè)文本關(guān)鍵詞方面具有優(yōu)勢。

    Abstract:

    There are two main problems in obtaining key information of forestry text, firstly, the key information is mainly considered from the perspective of keywords, and the information types of words are neglected;secondly, there is no unified description structure for forestry text on the Internet, which makes it difficult to extract word information types. Through combining the two characteristics of “keywords+information types”, a method about forestry text key information extraction was proposed based on inproved TextRank and clusters filtering. The main contents were as follows: the first step was to extract the text keywords according to the keywords extraction formula. The second step was to characterize the keywords with Word2Vec vectorization. The third step was to improve the TextRank algorithm, mainly by merging the word features and introducing the edge weights to construct the graph model of the text. The fourth step was to obtain the stable graph structures through iterative convergence, and then merged them to form clusters. And the clusters’s quality was evaluated from three aspects: the uniformity of elements distribution, the size of the clusters, and the universality of the clusters. The fifth step was to form the final clusters’set in combination with the TextRank algorithm. The final step was to label the final clusters about information types. The data used in the experiments were 2000 forestry texts related to forestry policies and news. The experimental results showed that compactness of the final clusters’ set was 0.9680, the separation of the final clusters’ set was 0.0572, and the F1-measure of the final clusters’ set was 0.8871. It showed that the information types of the clusters can be clearly marked. For a text’s keywords, their information type was obtained by calculating the cosine similarity of the keywords’ vector and the clusters’ heart. The combination of keywords and information types constituted key information of a foresty text. Meanwhile, manually labeled 400 texts, comparing with the six algorithms such as TextRank, TF-IDF, this method achieved the better results in MRR, Bpref, accuracy, and F1-measure. It showed that this method had advantages in extracting forestry text keywords.

    參考文獻(xiàn)
    相似文獻(xiàn)
    引證文獻(xiàn)
引用本文

陳志泊,李鈺曼,許福,馮國明,師棟瑜,崔曉暉.基于TextRank和簇過濾的林業(yè)文本關(guān)鍵信息抽取研究[J].農(nóng)業(yè)機(jī)械學(xué)報(bào),2020,51(5):207-214,172. CHEN Zhibo, LI Yuman, XU Fu, FENG Guoming, SHI Dongyu, CUI Xiaohui. Key Information Extraction of Forestry Text Based on TextRank and Clusters Filtering[J]. Transactions of the Chinese Society for Agricultural Machinery,2020,51(5):207-214,172.

復(fù)制
分享
文章指標(biāo)
  • 點(diǎn)擊次數(shù):
  • 下載次數(shù):
  • HTML閱讀次數(shù):
  • 引用次數(shù):
歷史
  • 收稿日期:2019-12-31
  • 最后修改日期:
  • 錄用日期:
  • 在線發(fā)布日期: 2020-05-10
  • 出版日期:
文章二維碼