• <small id="srzlk"><dfn id="srzlk"></dfn></small>

    <meter id="srzlk"><menuitem id="srzlk"><del id="srzlk"></del></menuitem></meter>

    <label id="srzlk"></label>

    <small id="srzlk"><dfn id="srzlk"></dfn></small>

    <code id="srzlk"></code>

  • 云南民族大學學報(自然科學版)

    2021, v.30;No.127(03) 264-271

    [打印本頁] [關閉]
    本期目錄(Current Issue) | 過刊瀏覽(Past Issue) | 高級檢索(Advanced Search)

    融合上下文語義信息的漢越平行短語對抽取方法
    An extraction method for chinese-vietnamese parallel phrase pairs with contextual semantic information

    楊艦;高盛祥;余正濤;朱浩東;文永華;
    YANG Jian;GAO Sheng-xiang;YU Zheng-tao;ZHU Hao-dong;WEN Yong-hua;Faculty of Information Engineering and Automation, Kunming University of Science and Technology;Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology;

    摘要(Abstract):

    越南語是一種典型的資源稀缺型語言,漢越平行語料較為稀少,但在如維基百科、雙語新聞等網站上存在大量的漢越可比語料.而從可比語料中抽取平行短語對任務能夠有效緩解低資源機器翻譯中面臨的數據稀疏性問題.考慮到上下文語義信息對抽取高質量的雙語短語對有重要支撐.提出了融合上下文語義信息的漢越平行短語對抽取方法.首先使用漢、越單語語料訓練漢、越向量矩陣;然后預訓練編碼器,通過注意力機制將句子編碼信息和短語編碼信息進行結合,生成含有上下文語義信息的單語短語向量,同時將平行短語對作為約束,使漢越短語向量在語義空間中距離最小化,非平行短語對的距離最大化,得到漢越雙語短語向量表示;最后利用預訓練好的編碼器來對平行短語對分類器進行訓練.實驗結果證明,所訓練的分類器的準確度達到75.62%,同時,為了檢測抽取出來的平行短語對質量,將其添加到SMT的訓練語料中,與基線系統相比,提升了0.93Bleu.
    Vietnamese is a typical resource-scarce language. Chinese-Vietnamese parallel corpus is relatively scarce, but there are a large number of Chinese-Vietnamese comparable corpora on websites such as Wikipedia and bilingual news. Extracting parallel phrase pairs from the comparable corpora can effectively alleviate the scarcity of the data in low-resource machine translation. Considering that contextual semantic information helps much for extracting high-quality bilingual phrase pairs, this paper proposes an extraction method for Chinese-Vietnamese parallel phrase pairs that can integrate contextual semantic information. First, it uses the Chinese and Vietnamese monolingual corpus to train the Chinese and Vietnamese vector matrices, and then pre-trains the encoder to combine sentence encoding information and phrase encoding information through an attention mechanism to generate a monolingual phrase vector containing contextual semantic information. Parallel phrase pairs are used as constraints to minimize the distance of Chinese-Vietnamese phrase vectors in the semantic space and maximize the distance of non-parallel phrase pairs to obtain the representation of the Chinese-Vietnamese bilingual phrase vector, and finally, a pre-trained encoder is used to pair parallel phrase pairs for which the classifier is trained. The experimental results show that the accuracy of the classifier trained by the method proposed in this paper reaches 75.62%. At the same time, in order to detect the quality of the extracted parallel phrase pairs, it is added to the training corpus of SMT, which improves by 0.93 Bleu compared with the baseline system.

    關鍵詞(KeyWords): 上下文語義信息;半監督自編碼器;平行短語對抽取;漢-越;可比語料
    contextual semantic information;Semi-AutoEncoder;extraction of parallel phrase pairs;Chinese-Vietnamese;comparable corpus

    Abstract:

    Keywords:

    基金項目(Foundation): 云南省重大科技專項計劃(202002AD080001);; 云南省基礎研究計劃(202001AS070014,2018FB104);; 國家自然科學基金(61761026,61972186,61732005,61762056);; 國家重點研發計劃(2019QY1802,2019QY1801,2019QY1800);; 云南高科技人才項目(201606)

    作者(Author): 楊艦;高盛祥;余正濤;朱浩東;文永華;
    YANG Jian;GAO Sheng-xiang;YU Zheng-tao;ZHU Hao-dong;WEN Yong-hua;Faculty of Information Engineering and Automation, Kunming University of Science and Technology;Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology;

    Email:

    DOI:

    參考文獻(References):

    擴展功能
    本文信息
    服務與反饋
    本文關鍵詞相關文章
    本文作者相關文章
    中國知網
    分享
    捕鱼世界