Document Analysis

The Chinese Toolbox Document Analysis feature compares the current document (or all loaded documents) against the currently selected character frequency list. Settings in the “Document Analysis Settings” window allow some customization of the analysis.

DocumentAnalysis01 DocumentAnalysis03 DocumentAnalysis02

 

In the screenshot on the left the text of the first loaded document is shown. In the middle screenshot the “Analyze Document” button/tab has already been clicked. You can see from the menu and in text of the analysis which frequency list was used. You can also see the “Document Analysis Settings” window where the number of characters to use for the analysis was specified. In this analysis all documents were merged and analyzed as a single document, determined by the first checkbox setting in the “Document Analysis Settings” window. In the screenshot on the right you can see the Document menu with the names of all the analyzed documents. These documents were imported into Chinese Toolbox 2012 from http://www.chineselearner.com/reading/chinese-translation/sherlock01-1.html.

Some interesting data is produced in the analysis, and you can use it however you like. The original intention was to determine how a set of Chinese documents could contribute to Chinese literacy. You can use this analysis in a number of ways:

This analysis feature provides three degrees of granularity:

  1. The analysis shown in the Reader. This is just a summary.
  2. If the second (bottom) checkbox in the “Document Analysis Settings” window is checked, some information, mostly lists of characters with their statistical data, is not written to the file, DocumentAnalysis.u8. In this example, the DocumentAnalysis.u8 file size is 103 KB.
  3. If the checkbox referred to above is unchecked, all analysis data is written to DocumentAnalysis.u8. The analysis takes several seconds longer, and the analysis file is about twice the size, 204 KB in this example.

The following is the analysis summary that appears in the screenshots above.

Analysis Summary. The detailed document analysis exists in DocumentAnalysis.u8 as a UTF-8 text file in your Chinese Toolbox document folder.

* Document: All loaded documents
* Frequency list: Modern Chinese Character Frequency List
* Total characters: 12954
* Total analyzed Chinese characters in document: 11181
* Total unique Chinese characters in document: 1263
* Characters in document exist within top 7928 of the frequency list.
* 9893 of 11181 (88.48%) of the characters in the document exist in the top 1000 of the current frequency list.
* 765 of 1263 (60.57%) of the unique characters in the document exist in the top 1000 of the current frequency list.

The following is the contents of DocumentAnalysis.u8 with some of the information excluded. For display on this page, most of the characters in the lists have been removed. Only the first few characters are shown so you can see the data pattern.

* Document: All loaded documents
* Frequency list: Modern Chinese Character Frequency List
* Total characters: 12954
* Total analyzed Chinese characters in document: 11181
* Analyzed Chinese characters in document: 银色马一天早晨我 (MANY REMOVED)
* Total unique Chinese characters in document: 1263
* Unique Chinese characters in document: 银色马一天早晨我们起 (MANY REMOVED)
* Characters in document exist within top 7928 of the frequency list.
* Unique character counts: 1:银:11; 2:色:21; 3:马:138; 4:一:239; 5:天:32; 6:早:12; 7:晨:6; 8:我:252; 9:们:105; 10:起:22;  (MANY REMOVED)
* Counts of the number of times frequency list characters occur in the analyzed document: 1:的:420; 2:一:239; 3:是:199; 4:不:133; 5:了:142; 6:在:166; 7:人:90; 8:有:114; 9:我:252; 10:他:172; (MANY REMOVED)
* 9893 of 11181 (88.48%) of the characters in the document exist in the top 1000 of the current frequency list.
* 765 of 1263 (60.57%) of the unique characters in the document exist in the top 1000 of the current frequency list.
* Document characters that exist in the analyzed portion of the frequency list: 银色马一天早我们一起 (MANY REMOVED)
* Document characters that do NOT exist in the analyzed portion of the frequency list: 晨餐摩穆摩皱眉刊 (MANY REMOVED)
* Unique document characters that exist in the analyzed portion of the frequency list: 银色马一天早我们 (MANY REMOVED)
* Unique document characters that do NOT exist in the analyzed portion of the frequency list: 晨餐摩穆皱眉刊售 (MANY REMOVED)

 

 

 

CombinedBanner1