API Reference

A character frequency can be computed or loaded via the charfreq function, either from some text or a predefined corpus.

CJKFrequencies.charfreqFunction
charfreq(text)
charfreq(charfreq_type)

Create a character frequency mapping from either text or load it from a default location for pre-specified character frequency datasets (e.g. SimplifiedLCMC, SimplifiedJunDa, etc.).

Examples

When creating a character frequency from text, this method behaves almost exactly like DataStructures.counter except that the return value always has type CharacterFrequency (Accumulator{String, Int}).

julia> text = split("王老师性格内向,沉默寡言,我除在课外活动小组“文学研究会”听过他一次报告,并听-邓知识渊博,是“老师的老师”外,对他一无所知。所以,研读他的作", "");

julia> charfreq(text)
CJKFrequency{SubString{String}, Int64}(Accumulator(除 => 1, 报 => 1, 是 => 1, 知 => 2, 并 => 1, 性 => 1, , => 6, 言 => 1, 邓 => 1, 外 => 2, 所 => 2, 对 => 1, 动 => 1, 寡 => 1, 。 => 1, 渊 => 1, 学 => 1, - => 1, 听 => 2, 我 => 1, 次 => 1, 一 => 2, 读 => 1, 作 => 1, 格 => 1, “ => 2, 博 => 1, 课 => 1, 老 => 3, 会 => 1, 告 => 1, 无 => 1, 活 => 1, 组 => 1, 内 => 1, 师 => 3, 的 => 2, 小 => 1, 文 => 1, 默 => 1, 究 => 1, 过 => 1, 在 => 1, 以 => 1, ” => 2, 研 => 2, 他 => 3, 向 => 1, 沉 => 1, 王 => 1), Base.RefValue{Int64}(71))

See the documentation for individual character frequency dataset structs for examples of the second case.

Supported Predefined Character Frequency Datasets

A Chinese character frequency dataset's struct's name will be prefixed with either Traditional or Simplified depending on whether it is based on a traditional or simplified text corpus.

CJKFrequencies.SimplifiedLCMCType
SimplifiedLCMC([categories])

A character frequency dataset: Lancaster Corpus for Mandarin Chinese, simplified terms only, based on simplified text corpus. See their website for more details about the corpus.

The character frequency can be based only on selected categories (see CJKFrequencies.LCMC_CATEGORIES for valid category keys and corresponding category names). Any invalid categories will be ignored.

Examples

Loading all the categories:

julia> charfreq(SimplifiedLCMC())
DataStructures.Accumulator{String,Int64} with 45411 entries:
  "一路…   => 1
  "舍得"   => 9
  "58"   => 1
  "神农…   => 1
  "十点"   => 8
  "随从"   => 9
  "荡心…   => 1
  "尺码"   => 1
  ⋮      => ⋮

Or loading just a subset (argument can be any iterable):

julia> charfreq(SimplifiedLCMC("ABEGKLMNR"))
DataStructures.Accumulator{String,Int64} with 35488 entries:
  "废…  => 1
  "蜷"  => 1
  "哇"  => 13
  "丰…  => 1
  "弊…  => 3
  "议…  => 10
  "滴"  => 28
  "美…  => 1
  ⋮    => ⋮

Licensing/Copyright

Note: This corpus has some conflicting licensing information, depending on who is supplying the data.

The original corpus is provided primarily for non-profit-making research. Be sure to see the full end user license agreement.

Via the Oxford Text Archive, this corpus is distributed under the CC BY-NC-SA 3.0 license.

CJKFrequencies.SimplifiedJunDaType
SimplifiedJunDa()

A character frequency dataset of modern Chinese compiled by Jun Da, simplified single-character words only.

Currently, only the modern Chinese dataset is fetched; however, in the future, the other lists may also be provided as an option.

Examples

julia> charfreq(SimplifiedJunDa())
DataStructures.Accumulator{String,Int64} with 9932 entries:
  "蜷… => 837
  "哇… => 4055
  "湓… => 62
  "滴… => 8104
  "堞… => 74
  "狭… => 6901
  "尚… => 38376
  "懈… => 2893
  ⋮   => ⋮

Licensing/Copyright

The original author maintains full copyright to the character frequency lists, but provides the lists for research and teaching/learning purposes only, no commercial use without permission from the author. See their full disclaimer and copyright notice here.

Other data sets are planned to be added. To add a data set to this API, see the Developer Docs page.