CrossAsia N-gram Service

The full-texts stored in the CrossAsia ITR provide an excellent source for all who are interested in digital humanities and in statistical, linguistic and other computational analysis. Access to the licensed full texts is only available for registered users. By preparing the texts as N-gram datasets, i.e. splitting the texts into fragments - for Chinese texts into fragments with one, two or three character combinations - and presenting only the frequency of the respective fragments in the corpus, they can be downloaded as unrestricted N-gram datasets.

Here you can download various datasets and explore them on your own computer and with your own tools. Currently, three datasets of Chinese text collections with N-grams per book are available, each with uni-, bi-, and trigrams. In addition, we will publish further datasets as well as some online services for analysing the N-grams, soon.

N-gram datasets

Collection Description N-gram
Xuxiu Siku Quanshu 續修四庫全書 'Sequel to the Siku quanshu' of the late 18th century with more than 5,000 titles.

Local Gazetteers Chinese geographical works from Tang dynasty to the Republican Era with about 8,000 titles.

Daozang Jiyao 道藏辑要 'Essentials of the Daoist Canon', collection of Daoist texts with about 300 titles.

The datasets are released under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

Recommended citation: Please follow the suggestions provided in the repository.