中文簡繁轉換兩三事 Opencc

- 11月 30, 2021

中文簡繁轉換之前是用 convertz ，一來應用程式已經不知道放哪兒去了，二來目前網站上的範例用的都是 opencc ，不想折騰就跟風了，結果還是折騰了一下。

工作環境：
    Windows 10
    Anaconda
    Python 3 (3.6,3.8)
        opencc, hanlp
    MS Visual Studio Build Tool + VC
        opencc v1.1.3

總結一下，要在 Windows 上使用 OpenCC ，大致上要裝 VC ，下載 Opencc 來編譯，設定引用檔案，複製引用檔案到專案資料夾，開始使用 OpenCC。

OpenCC 目前是放在 https://github.com/BYVoid/OpenCC 沒有已編譯好的執行檔，如果要的話，得自已編譯，

如果要用 python 處理，安裝倒是簡單

pip install opencc

然後就

import opencc
converter = opencc.OpenCC('s2t.json')
converter.convert('汉字')  # 漢字

好在之前為了要抑制 python gensim 產生的提示訊息要

pip install python-Levenshtein

在安裝時環境所需已經先裝好了 VC(v142)

因為想要使用範例中的指令方式處理資料所以自已編譯 opencc ，照著 OpenCC 說明檔

git clone https://github.com/BYVoid/OpenCC.git 
build.cmd
test.cmd

之後到 build/bin 裏面去找執行檔就行

OpenCC 的 data 裏有 config, dictionary 在轉換時會用到，要用的時侯，複製到專案的資料夾中 OpenCC 才能作用，其中設定檔要改一下，因為沒有發現所需的 osd 檔，好在 txt 檔還是有的，

s2tw.json： type: ocd 改成 type: text ； file:*****.ocd 改成 file:*****.txt

再把相對應的 txt 檔(也就是下方所指的 *****.txt)複製到專案資料夾

{
  "name": "Simplified Chinese to Traditional Chinese (Taiwan standard)",
  "segmentation": {
    "type": "mmseg",
    "dict": {
      "type": "text",
      "file": "STPhrases.txt"
    }
  },
  "conversion_chain": [{
    "dict": {
      "type": "group",
      "dicts": [{
        "type": "text",
        "file": "STPhrases.txt"
      }, {
        "type": "text",
        "file": "STCharacters.txt"
      }]
    }
  }, {
    "dict": {
      "type": "text",
      "file": "TWVariants.txt"
    }
  }]
}

花了將近 111 分抓完維基的資料轉成文字檔，燙得不得了的筆電還得開電扇給它散熱 XD，INFO : finished iterating over Wikipedia corpus of 411713 documents with 95495916 positions (total 3881077 articles, 112215074 positions before pruning articles shorter than 50 words) 。

opencc 將處理完的資料轉成繁體中文又花了27分鐘，處理這個144873KB的檔案。

opencc -i wiki_texts.txt -o wiki_zh_tw.txt -c s2tw.json

都編譯好 opencc 了，才發現 Hanlp 內建了簡繁轉換函式，大概像是 HanLP.s2t(text)，真的是折騰啊~~

收工！

搜尋此網誌

水月居幸福遊記

中文簡繁轉換兩三事 Opencc

留言

張貼留言

這個網誌中的熱門文章

LINE 儲存的檔案傳到 email 不方便很不方便非常不方便但是有解的筆記

合併列印標籤漏印

使用 Excel 計算2個地點之間的直線距離

中文簡繁轉換兩三事 Opencc

留言

張貼留言

這個網誌中的熱門文章

LINE 儲存的檔案傳到 email 不方便 很不方便 非常不方便 但是有解的筆記

合併列印標籤漏印

使用 Excel 計算2個地點之間的直線距離

LINE 儲存的檔案傳到 email 不方便很不方便非常不方便但是有解的筆記