トライの実験に使えるちょっとしたツール

トライを構築したときのノード数が分からない，TAIL を導入したときにサイズがどのくらい小さくなるのか分からない，そんな悩みに答えるちょっとしたツールのソースコードです．

各ノードのサイズとノード数が分かればトライのサイズは簡単に求まるので，トライについて少し試してみようという段階で便利かもしれません．

例えば，ダブル配列であればノード数 n に対して 4n bytes か 8n bytes が基本です．ただし，以下のツールは終端文字による遷移をカウントしないことに注意してください．後，TAIL の長さが m であれば，m bytes を追加することになります．

また，LOUDS Trie であれば，木の表現に 2n bits，ラベルの保存に n bytes，終端フラグの保存に n bits，後は rank/select の索引に…という具合で 1.5n bytes くらいになると思います．ux-trie のようにトライを二つ使う場合でも，同じようにして全体のサイズを推定できます．

構築されるトライのノード数を求める（trie）

整列済みのキー集合から構築されるトライのノード数を求めます．キー集合が整列されていないときは，LC_ALL=C sort -u keyset | ./trie などとしてやりましょう．

// trie.cc
#include <iostream>
#include <string>

int main() {
  std::ios_base::sync_with_stdio(false);
  std::size_t num_nodes = 1;
  std::string last_line;
  std::string line;
  while (std::getline(std::cin, line)) {
    if (line.empty()) {
      continue;
    }
    std::size_t match = 0;
    while ((match < line.length()) && (line[match] == last_line[match])) {
      ++match;
    }
    num_nodes += line.length() - match;
    last_line = line;
  }
  std::cerr << "nodes: " << num_nodes << std::endl;
  return 0;
}

TAIL 導入時のトライ部を取り出す（prefix）

TAIL を導入するとノード数はどのくらいになるのかを知りたいときに使えるツールです．整列済みのキー集合を入力すると，キーのトライ部に対応する部分が出力されます．./prefix < keyset | ./trie としてやれば，トライ部のノード数が分かります．

// prefix.cc
#include <iostream>
#include <string>

int main() {
  std::ios_base::sync_with_stdio(false);
  std::string last_line;
  std::size_t last_match = 0;
  std::getline(std::cin, last_line);
  std::string line;
  while (std::getline(std::cin, line)) {
    if (line.empty()) {
      continue;
    }
    std::size_t match = 0;
    while ((match < line.length()) && (line[match] == last_line[match])) {
      ++match;
    }
    if (match != last_line.length()) {
      last_line.resize(std::max(last_match, match) + 1);
    }
    std::cout << last_line << '\n';
    last_line = line;
    last_match = match;
  }
  last_line.resize(last_match + 1);
  std::cout << last_line << '\n';
  return 0;
}

TAIL 導入時の TAIL 部を取り出す（tail）

./prefix の逆で，キーの後半部分のみを出力します．空文字列は出力しないようになっています．./tail < keyset | wc としてやれば，TAIL 部の長さが分かります．

// tail.cc
#include <iostream>
#include <string>

int main() {
  std::ios_base::sync_with_stdio(false);
  std::string last_line;
  std::size_t last_match = 0;
  std::getline(std::cin, last_line);
  std::string line;
  while (std::getline(std::cin, line)) {
    if (line.empty()) {
      continue;
    }
    std::size_t match = 0;
    while ((match < line.length()) && (line[match] == last_line[match])) {
      ++match;
    }
    if (match < last_line.length()) {
      std::size_t max_match = std::max(last_match, match);
      if (max_match + 1 < last_line.length()) {
        std::cout << last_line.substr(max_match + 1) << '\n';
      }
    }
    last_line = line;
    last_match = match;
  }
  std::cout << last_line.substr(last_match + 1) << '\n';
  return 0;
}

TAIL 部をトライにしたときのノード数を求める（reverse）

TAIL を逆順にしてトライを構築するとノード数がどのくらいになるのかを調べるには，./tail の出力を逆順にしてから ./trie に渡します．コマンドにすると，./tail < keyset | ./reverse | LC_ALL=C sort -u | ./trie になります．

// reverse.cc
#include <algorithm>
#include <iostream>
#include <string>

int main() {
  std::ios_base::sync_with_stdio(false);
  std::string line;
  while (std::getline(std::cin, line)) {
    if (line.empty()) {
      continue;
    }
    std::reverse(line.begin(), line.end());
    std::cout << line << '\n';
  }
  return 0;
}

# トライを左側と右側に分けるのは，たしか Two-Trie だか Double Trie という名前が付いていたと思います．ux-trie（http://code.google.com/p/ux-trie/）はコレの実装に簡潔データ構造を導入することにより，とてもコンパクトな辞書を実現しています．

"A Trie Compaction Algorithm for a Large Set of Keys", Software - Practice and Experience, Vol. 24, No. 3, pp. 265-288. March 1994.
"A Method of Compressing Trie Structures", IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 3, pp. 476-491, June 1996.