CrossAsia N-gram Service

The full-texts stored in the CrossAsia ITR provide an excellent source for all who are interested in digital humanities and in statistical, linguistic and other computational analysis. Access to the licensed full texts is only available for registered users. By preparing the texts as N-gram datasets, i.e. splitting the texts into fragments - for Chinese texts into fragments with one, two or three character combinations - and presenting only the frequency of the respective fragments in the corpus, they can be downloaded as unrestricted N-gram datasets.

You can download the various datasets and explore them on your own computer and with your own tools. Currently, three datasets of Chinese text collections with N-grams per book are available, each with uni-, bi-, and trigrams.

If you are interested in N-gram sets of any other resources from the CrossAsia context, please contact us! These sets with then also be made available via this page.

N-gram datasets

Collection Description N-gram
Xuxiu Siku Quanshu 續修四庫全書 'Sequel to the Siku quanshu' of the late 18th century with more than 5,000 titles.

Local Gazetteers Chinese geographical works from Tang dynasty to the Republican Era with about 8,000 titles.

Daozang Jiyao 道藏辑要 'Essentials of the Daoist Canon', collection of Daoist texts with about 300 titles.

The datasets are released under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

Recommended citation: Please follow the suggestions provided in the repository.