ICU（C++）を使って Unicode 正規化

文字列を処理するのに Python はとても便利ですが，速度が気になる状況では，やはり C/C++ を使いたいところです．ということで，Unicode 正規化（http://homepage1.nifty.com/nomenclator/unicode/normalization.htm）のために ICU（ICU - International Components for Unicode）を試してみました．

インストール

Synaptic パッケージ・マネージャから libicu-dev をインストールできました．

Unicode 正規化のテスト

// g++ には `icu-config --cppflags --ldflags` を渡します．
#include <unicode/normlzr.h>

#include <iostream>

int main()
{
  // utf-8 から内部コードに変換します．
  icu::UnicodeString src("全角　ＡＢＣ", "utf-8");

  // NFKC へと正規化します．
  icu::UnicodeString dest;
  UErrorCode status;
  icu::Normalizer::normalize(src, UNORM_NFKC, 0, dest, status);

  // 失敗したかどうかは status を見て調べます．
  if (U_FAILURE(status))
    std::cerr << u_errorName(status) << std::endl;

  // 内部コードから utf-8 に変換します．
  // extract() の戻り値は，出力される文字列の長さです．
  char buf[16];
  int32_t length = dest.extract(0, dest.length(), buf, sizeof(buf), "utf-8");
  if (static_cast<std::size_t>(length) < sizeof(buf))
  {
    // 変換結果は "全角 ABC" になります．
    std::cerr << buf << std::endl;
  }
  else
  {
    // 大きな領域を確保してから extract() を呼び出せば変換できます．
    std::cerr << "error: requires at least " << (length + 1) << "bytes" << std::endl;
  }

  return 0;
}

ICU4C のドキュメント

icu::UnicodeString::UnicodeString()
- http://icu-project.org/apiref/icu4c/classUnicodeString.html#3ab203e2943154d735bb7f8050958401
icu::Normalizer::normalize()
- http://icu-project.org/apiref/icu4c/classNormalizer.html#974f2663bb227b116d47a1e43bc84454
UErrorCode
- http://icu-project.org/apiref/icu4c/utypes_8h.html#863c11989634c998849cc946d04dfabe
icu::UnicodeString::extract()
- http://icu-project.org/apiref/icu4c/classUnicodeString.html#7490d79ee65e65a495269dc044bc2f80