Thai Tokenization

  • This is a GUI  for Thai tokenization using TLTK package. 
  • Python 3 and TLTK is required. 
  • The latest version of TLTK can do word segmentation, POS tagging, utterance segmentation, and named entity tagging. Please follow these steps to install Thai tokenization in your computer.
  • 1) Install Python3, we recommend installing “Anaconda3”. This will install both Python and other useful packages.
  • 2) At anaconda prompt, type pip install tltk” to install TLTK
  • 3) Download GUI script for tokenization <> into your working directory. 
  • 4) Go to your working directory, run this script via command line python

        If your prefer to do tokenization online, click <here>

  • An earlier version of Thai word segmentation that includes POS tagging is released in version 3.1 <Windows (64-bit)>
  • Input File must be a plain text with utf-8 encoding.
  • POS tagging is done by using nltk.tag.perceptron  trained with 150,000 pos-tagged words
  • Word segmentation program is now a  function in TLTK module in Python (PyPI)

  • For online service, click <here>
Screenshot 2018-06-30 13.53.11

  • Syllable segmentation is done by applying Thai syllable rules. Segmentation ambiguities are resolved by using a trigram model of syllables, trained with a corpus of 630,000 syllables from a newspaper.
  • Word segmentation is performed by using maximum collocation approach. (see the paper "Collocation and Thai Word Segmentation" submitted to SNLP-COCOSDA2002 conference).
  • Dictionary used in the program is adapted from the Royal Institute Dictionary, which is made available by LINKS. But some obsolete words are deleted from the dictionary. There is no routine to handle proper names, abbrviations directly yet. Thus, segmentation of sentences containing a proper name could be incorrect.

  • Earlier versions can be found below:
  • A stand alone version running on Windows XP can be downloaded <here> (version 2.1)
  • A DOS version can be downloaded <here>. You will need to unrar all files into a specified directory. To run the program, type "thaiseg  INPUTFILE OUTPUTFILE  /w or /s  (/vb)" The last option (verbose) is optional.
  • Version 2.2 for Windows 7 is <here>
  • This program can be used for non-commercial purposes. 

This program was a part of a project supported by the Research Division of the Faculty of Arts, 2000-2.
Written by Wirote Aroonmanakun. Copyright 2002.

© Wirote 2012