離籬原上草 好奇
1 years ago @Edit 1 years ago
for personal record
詩經辭賦 have me doubting my chinese ability as always, and that got me curious about how many chinese characters I know or write on a daily basis.

To answer this (and to very much knock my confidence) I put together something to count the number of unique chinese characters in a passage.

That had me wondering about authors I read.
latest #15
離籬原上草
1 years ago
A quick search online suggests that a chinese user knows and uses 1000-3000 characters regularly, and after high school you supposedly have took in over 6000 characters -- something I'm pretty sure I didn't do.
離籬原上草
1 years ago
That said, I got curious about the number of unique characters these authors I read usually use. That doesn't at all indicate writing competency, but well, it felt interesting regardless so I decided to go play with that a bit.

For reference 金庸's 笑傲江湖/倚天屠龍記 had around 3.5-4k unique characters.
離籬原上草
1 years ago
So I've recorded some numbers and figures, not necessarily all useful (and very much confusing to reach my conclusions), but at least it's something to work with.

And important Non-disclaimer: alright to do this I've downloaded a number of works I wanted to test. They were searchable online as txts because they were popular enough.
立即下載
離籬原上草
1 years ago
For the fanfic ones I had them on my computer a long time ago //plurk please your character limits are a tad limiting
離籬原上草
1 years ago
I made attempts to analyse the data in a fairer way, looking at the total word count, and the highest frequency of "的"s used in the passage. Such are the conclusions that sort of helped, but at the same time there are always outliers that manage to muddle up your understanding every single time.
離籬原上草
1 years ago
So now, observations
離籬原上草
1 years ago
1. There are authors with rather good writing, and that does not always equate a high character set. Conversely, there are authors who basically cannot write well and still end up with a surprisingly not bad character set.
https://images.plurk.com/3QoqwQS4l1ArlPjBqMBuRz.png
離籬原上草
1 years ago
Apparently works like 靜影沉璧、瀟湘水冷、風骨同守 that had really 古風 writing styles had a lower character count than the others. This was kind of not what I expected considering these works are where I find words I'm the most unfamiliar with and end up with me questioning my chinese reading ability (sorry lol).
離籬原上草
1 years ago
I am also surprised with 潭石、水千丞、覆水難收、木蘇里 because their writing are honestly in the "not bad" to "can be considered good" range, while using a much smaller lexicon than what I've expected.

(不問三九 was also low and in this range, but I forgot to mark hers down so, shrugs)
離籬原上草
1 years ago
2. That said, some authors do typically reach a range of ~3.6-3.7k characters, and that is a pretty wide character set. Points 2 and 3 go together so see table in 3.
離籬原上草
1 years ago
3. These authors in 2. tend to round up at ~4k+ words with the addition of more books.

With each book there would be an increase of a few hundred characters, but they tend to level off at ~4.3k etc., and that is likely the maximum of such good authors in 2.

Some do level off at ~3.7k though, which indicates a narrower range of lexicon employed.
離籬原上草
1 years ago
https://images.plurk.com/60cE5l0237SbnulJMf9Kxo.png
離籬原上草
1 years ago
4. At ~1M to 1.5M words from the author's books, I daresay the unique character count is pretty steady already. Nonetheless there are still some authors whose number of unique characters used keep on increasing at this level still, which is, lowkey irritating, perplexing,
離籬原上草
1 years ago
and at the same time I just don't understand how their choice of word varies that greatly from story to story (given the settings are not necessarily that different).

https://images.plurk.com/4uk2ltGz6ys4PYQeGPS14b.png

This category aggravates me so much I couldn't make head and tail of it honestly lol.
離籬原上草
1 years ago
5. At the same time there are authors who naturally have a great range, considering they max at a high range, or that at a low word count total they already have a high character range.

https://images.plurk.com/1iSG6OwyXUg95DXgiZ6qdR.png
back to top