Program WordTabulator. User's Guide

Version 3.х

Table of content

1. Introduction

Program wordTabulator is intended for text analysis on MS Windows systems. With help of wordTabulator you can generate index of word elements extracted from defined text set. Word elements may be words, N-grams (of defined size) or phrases (syntagmes). The program can process texts as in ordinary 1-byte encoding (ANSII), as in multibyte UTF-8 encoding. Originally this program was created for processing of Russian text exclusively, but may be successfuly used for other languages too. For example, it may be Ucranian, Islandic or Swedish. Definition of source texts' language is quite formal. Actually it is Cyrillic and non Cyrrilic (God excuse me!).

WordTabulator processes correctly any Cyrillics and takes into account abolished Russian letters І, Ї (yi), Ѣ (yat), Ѳ (phita), Ѵ (izhitsa), which are contained in second edition of Vladimir Dahl's dictionary (published in 1880-1882). The program also correctly processes diacrticis for Europen and Scandinavian languages (letters with grave, acute, tilde, diaeresis etc.). Text in UTF-8 may contain absolutely any letters - even Ancient Egyptian or Chinese hieroglyphs.

Source texts are defined as a set of flat text files or HTML/XML/SGML documents. In the last case the program can filter content from markup. Moreover, you can process only defined content within selected paired tags. Or you can skip that content from processing.

As additional feature you can analyse a pair of text sets and compare them by common or different elements.

In the case of Russian texts you can search words defined in normal form by rules of Ruissian morphology and find all case endings. Also you can search by regular expressions.

Output of program is a word index of all found text elements. Word index may be generated in HTML format and contain frequences of each text element and links to original content. Also it may be generated as a flat text file. Words in the index are ordered by alphabet, value or frequence.

Theoretically, the size of analysed text set is not limited. It depends only on available computing resources and needed time. For example, the complete set of works of Fedor Dostoevsky in 15 volumes is processed just in 11 minutes on my Lenovo netbook (Genuine Intel 1.3 GHz with 3 Gb RAM). Source files are 60 Mb in the size and output index contains about 200 thousands different words.

Program wordTabulator is a free and open source software. It's home sites are sourceforge.net and Russian Virtual Library. The console processing module is written on Icon Programing Language and graphical UI is developed with help of Delphi 7.

1.1 Credits

Program wordTabulator initially was developed with grants of "Open Society" Institute allocated to implement project of Russian Virtual Library (RVB).

Afterwards the support was provided by editorial board of RVB and personally its technical editor, Vladimir Litvinov. Thanks him for creative support and many fruitful ideas!

Everlasting memory and respect to Ralph Griswold, who invented Icon Programming Language on which the console modules of wordTaulator are implemented.

Very useful code of pipe-interface using for MS Windows was shared to author by Christo Crause from South Africa.

Search of files in wordTabulator is implemented with help of Delphi-component TmFileScan developed by Mats Asplund from Sweden.

1.2 History of changes

October 1997

The first version of wordTabulator as console application is created (very limited by fuctionality).

December 1999

Version 1.1 of wordTabulator for Windows is published on site Russian Virtual Library. Console module was strongly redeveloped and GUI was added.

January 2000

Version 1.2 of wordTabulator for Windows is published. Setup procedure is fixed for shortcut's creation on desktop. An error in the case of processing folder names including russian letter Ё and white spaces is fixed.

November 2001

Version 2.0 is created. GUI is redeveloped. Morphology module is added to generate Russian words by normal form. Internal browser to view output word index and marked context is added.

February 2002

Published version 2.2. Fixed some errors.

May 2002

Published version 2.2.1. Output index is conformed to specification HTML 4.01

January 2004

Published version 2.2.3 in which internal browser was extended by functionality. Users got possibility to mark and copy text from browser window in standard way.

February 2009

Project wordTabulator was created on sourceforge.net.

December 2011

Published version 3.4 of wordTabulator for MS Windows. GUI and console modules were cardinally redeveloped. The process lasted for a few years, when many unpublished intermediate versions created and died. Support of UTF-8 encoding added. True alphabetical ordering of words in output index added. Using of regular expressions for search is addded. And many others enchancements added.

March 2012

Published version 3.5, in which bugs found in GUI and console module were fixed. Also new functions were added:

October 2016

Version 3.6 was published on site of Russian Virtual Library. In this release the GUI was migrated from Delphi to Lazarus IDE and fixed some minor errors. It was planned that version 3.6 will be final, as whole functionality of wordTabulator was integrated into xMarkup. But the question of GUI wasn't resolved since that time. However version 3.6 is a final and wordTabualtor as a single project closed.

September 2020

Version 3.6 was published on sourceforge.net, so the users of wordTabulator didn't loose the program.

2. Terms and definitions

I should warn that below are given definitions, which used only within wordTabulator. These definitions may be slightly incorrect or arbitrary from point of view of professional linguistics. It doesn't matter.

Word or lexeme is a specific word in specific grammatical form. More formally, word is a sequence of alpha-numeric symbols bounded in the text with punctuation marks or white spaces. Word may contain hyphen (-), apostrophe (') and point (.) in the case of abbrevations. Set of symbols which permitted within word may be extended by user. Word must begin with letter or digit. Symbols within word may be represented by:

Examples of words:  Ивановѣ,  O'Genry,  step-by-step,  A.D.,  Pushkin,  1837,  καρπῶν.

One kind of word is an abbreviation. These ones are name initials A. S.,  J.-P. or abbrevaiations as h.,  p.p.,  p.m.,  U.S.S.R.. Such abbreviations always include point at the end. Other kinds of abbreaviations, for example etc. or ex.of.nam.abbr. are not percieved by program as abbrevations and processed as a set of particular phrases (from only word).

N-Gram or word conjuction is a sequence of N words within a phrase. For example, 3-Grams Road to hell and paved with good within phrase "Road to hell paved with good intentions.".

Phrase or syntagma is a sequence of words in the sentence which bounded with punctuation marks or any non alpha-numeric symbols (for example, brackets). Set of delimeters for syntagma may be redefined by user. Words within syntagma delimited by white spaces or tabualations. For example, Latin proverb "Sanum corpore, mens sana" includes two phrases (syntagmas) - Sanum corpore and mens sana.

3. Algorithm of text processing

All source texts must be in the same encoding, it is a rule of thumb. The processing is always performed by following algorithm:

1) Source text converted to intermediate representation according to source encoding and format.

2) Syntgmas extracted from the text by defined set of delimeters.

3) If it's defined particular words or N-Grams extracted from each syntagma.

4) Results are checked by defined limitations on the length or frequence or by search queries.

5) If exclusion set defined then set operation is performed on results (union, minus or intersect).

6) Ouptut index is ordered as defined and generated in soecified output format.

7) If defined output index is post-processed by specified script.

4. Program components

Program wordTabulator includes following separate modules:

5. Creation of text processing project

Working in WordTabulator begins with creation of the project. It means that user shall define source texts to process and parameters how they will be processed. As program starts new empty project is created automatically. Now we shall define source texts.

5.1 Define source texts

On the pane [Source Files] our File Tree initially has two empty folders - [Input Texts Set] and [Exclusions Set]. Let us choose file type as (*.htm) and then press button [Search & add file tree]. You'll see open dialog window to choose source folder. Suppose that we want to process texts of Fedor M. Dostoevsky which stored in the folder f:\texts\rvb\dostoevsky. Choose and mark this folder and then press [OK].

Search & add file tree

All HTML files from all nested subfolders in folder [dostoevsky] will be added to File Tree of input texts. And original folder's structure will remain not changed. Below in the black console window you see that 1226 files were added.

Input Texts Set

We are not interested in the file "f:\texts\dostoevsky\index.htm", which just contains the table of contents. Remove it from File Tree - mark its title and press button . Don't afraid - this file will not be removed from your hard disk! It will be removed only from File Tree. In the same way you can remove other files or folders. If you want to just clear a folder without its deletion, you may use button . Again folder's purging will not take effect on the level of your hard disk. It will only clear folder in File Tree.

Adding files to any folder of File Tree maybe done many times. For that you shall mark needed folder in File Tree and then add files with help of buttons [Search & add file tree], [Search & add files] or [Add files]. Navigation on File Tree shall be intuitively understandable. Right mouse click open context menu.

Context menu

So, we have defined source texts to process. No exlusions define. We are ready to set parameters of text processing on the pane [Word Options].

5.2 Define processing rules

Word Options

Define encoding of source texts if it different from windows-1251 (default encoding for wordTabulator). In the case of Cyrillic it also maybe KOI8-R or CP866. Please remember that windows-1252 encoding shall be used only for non Cyrillic texts, for example, English or Swedish. And encoding of all source texts must be the same!

By default type of processed elements is a word, which may contain one or more letters. If say more exactly not letters but symbols, which correspond to letters. By default Cyrillic words may contain following characters of windows-1251 charset:

Set of letters can be extended - you can define such characters or their codes in the field [Extended letters]. Codes of ANSI characters may be defined as \ddd, where ddd - decimal code 0..255. You may also use codes in hexadecimal notation \xhh. For example, codes \255 and \xFF define the same character.

Additionaly word may contain (but only in the middle) following extended characters:

Set of such characters can be extended too. You can define such characters or their codes in the field [Extended characters].

As it was mentioned above, characaters within a word may be presented both named HTML entities and NCR-codes (Numeric Character Reference) too. In this case set of "letters" getting on practice unlimited, as wordTabulator eats such characters without any limitations. Please note following characters І,  Ї,  Ѣ,  Ѳ,  Ѵ, which are old Cyrillic letters dismissed after 1918.

If needed you can change set of stop-characters in the field [Delimiters]. These characters define bounds of phrases extracted from source text. You should understand that characters of point (.), comma (,), colon (:), exclamation mark (!), question mark (?) and round brackets are constants as delimeters and can't be redifined. However, you can add to delimeters any other ANSI characters, for example letters or digits.

Now we can go to the pane [Output Index] to define format of output word-index.

5.3 Define Output Index

Output Index

Output index is a result of text processing and contains found text elements. The index may have three different formats:

List of items is sorted by default in alphabetical order. It means that all items are processed in case insensitive way and outputted with initial capital letter. Order of items corresponds to Cyrillic alphabet (it takes into acount abolished Russian pre-reform letters). Latin words and diacrticis are located after Cyrillics. Alphabetical sorting correctly process letters defined by NCR-codes and named HTML-entities too.

You can define ascending or descending sort order.

If you want to process words in case sensitive way you should change alphabetical order to order by words' values (it is default for Windows sorting by values of ASCII-codes) or order by frequency. If output index is extremely large you may choose no order to speed up the processing. To define case sensitive processing you should come back to pane [Words Options] and check [match case].

By default the output index is created in a single file, name of which is defined in field [Output Index]. If your HTML-index will contain tens of houndreds of items it may be very difficult to view it in HTML-browser. In this case you better choose output items "by parts of N" (N=1000 is most optimal).

Output items

In this case in output folder will be created many parts (files), which united with help of navigational hyperlinks. You can easely view output index by parts, switching from part to part by just one mouse click.

There are other output modes, for example, only first or last N items.

If needed you can define post-processing of output index with help of script executed by program xMarkup. Post-processing means defined transformation or analysis of output index or generation of additional results. Detailed description of post-processing modes is provided in chapter 8.

5.4 Tags Options

Tags Options

In the case of HTML/XML/SGML documents you can additionally define as wordTabulator should process paired tags <tag-name> </tag-name>. By default content within paired tags title and script is always skipped from processing. If you want you can define any other paired tags and say as their content should be processed - ignored or not ignored.

To automatically generate list of all paired tags used for markup in source texts you can press button [Analysis]. In this case the process of analysis will started and after some time you will see ordered list of tags in a window.

Список тегов

You can edit this list, for example, remove redundant tags or instead of many tags p class="comment" id=* define common class p class="comment". You can save list of tags for future using.

5.5 Define Eclusions Set

If you want to skip from processing some elements (stop-words), you can prepare list of such elements in separate file. Format of the list doesn't matter. You can write words separated by commas, white spaces or by one word in line. Add prepared file to folder [Eclusions Set] on pane [Source Files].

You can define more complicated processing with help of exclusins. For example, you can compare two different texts set. In this case compared texts added correspondingly to folders [Input Texts Set] and [Exclusions Set]. Then on pane [Words Options] you shall choose needed processing mode for texts sets in list box [Exclusions operation]:

Exclusions operation

5.6 Advanced word search for Russian

If source texts are in Russian and you want to fast find words in all case endings the list of searched words is defined in window [Search templates] on pane [Words Options]. And you shall check [use morphological module] to start morphological module as a service for wordTabulator.

morphological module

Morphological module (Russian words generator) was developed by Janna G. Anoshkina (Institute of Russian Language of Russian Academy of Science) and uses base dictionary prepared on the base of "Grammatical Dictionary of Russian Language" of A.A.Zaliznyak. Dictionary of morphological module contains following number of basic forms (lemms):

Total volume of dictionary is more 90000 basic forms and near 2 billions of words. Euristic method is used for words which missed in a dictionary. Nouns, verbs and personal names should be defined for search in normal form and singular.

To get right results you shall define searched words in normal form. For example, nouns shall be defined in nominative case and singular form.

To switch off morphological module you need to uncheck [use morphological module].

5.7 Search by regular expressions

wordTabulator understand search queries defined by regular expressions. For that you shall check [regular expressions]. Note, that morphology module must be switched off. In window [Search templates] you can define a few regular expressions, which be processed as a set of alternatives united by "OR" conjuction. Check of regular expressions is always performed by beginning of word, not by inclusion. For example, to find all abbrevations and name initials you can specify:

Regular expressions

The regular expression format is very close to format supported by the UNIX "egrep" program, with modifications as described in the Perl programming language definition. Following is a brief description of the special characters used in regular expressions. In the description, the abbreviation RE means regular expression.

c An ordinary character (not one of the special characters discussed below) is a one-character RE that matches that character.
\c A backslash followed by any special character is a one-character RE that matches the special character itself.
. A period is a one-character RE that matches any character.
[string] A non-empty string enclosed in square brackets is a one-character RE that matches any *one* character of that string. If, the first character is "^" (circumflex), the RE matches any character not in the remaining characters of the string. The "-" (minus), when between two other characters, may be used to indicate a range of consecutive ASCII characters (e.g. [0-9] is equivalent to [0123456789]). Other special characters stand for themselves in a bracketed string.
* Matches zero or more occurrences of the RE to its left.
+ Matches one or more occurrences of the RE to its left.
? Matches zero or one occurrences of the RE to its left.
{N} Matches exactly N occurrences of the RE to its left.
{N,} Matches at least N occurrences of the RE to its left.
{N,M} Matches at least N occurrences but at most M occurrences of the RE to its left.
^ A caret at the beginning of an entire RE constrains that RE to match an initial substring of the subject string.
$ A currency symbol at the end of an entire RE constrains that RE to match a final substring of the subject string.
| Alternation: two REs separated by "|" match either a match for the first or a match for the second.
() A RE enclosed in parentheses matches a match for the regular expression (parenthesized groups are used for grouping, and for accessing the matched string subsequently in the match using the \N expression).
\N Where N is a digit in the range 1-9, matches the same string of characters as was matched by a parenthesized RE to the left in the same RE. The sub-expression specified is that beginning with the Nth occurrence of "(" counting from the left. E.g., ^(.*)\1$ matches a string consisting of two consecutive occurrences of the same string.

The following extensions to UNIX REs, as specified in the Perl programming language, are supported.

\w Matches any alphanumeric (including "_").
\W Matches any non-alphanumeric.
\b Matches only at a word-boundary (word defined as a string of alphanumerics as in \w).
\B Matches only non-word-boundaries.
\s Matches any white-space character.
\S Matches any non-white-space character.
\d Matches any digit [0-9].
\D Matches any non-digit.

Symbols \w, \W, \s, \S, \d, \D can be used within [string] REs.

There are a few simple regular expressions, which you can use for search:

(A|Z)Words, which begin with letters "A" or "Z"
(a|z)$Words, which end with letters "a" or "z"
\w*(a|z)Words, which include letters "a" or "z" (at any positions)
\w+-\w+Words, which include hyphen ("good-looking")
\w+ing$Words, which end by "ing" suffix
[A-Za-z]+$Words, which consist only Latin letters
[IVX]+$Roman numerals (from 1 to 49) or words from letters I,V,X

5.8 Preferences

On pane [Preferences] you can set up additional options:

Preferences

With help of list box [Interface language] you can choose Russian or English language for GUI.

Path in field [Temp directory] defines folder, in which by default will be created output word index and temporary files. With help of button on left you can choose this folder from existing on your computer.

Path in field [External text editor] defines filepath of text editor, which will be used for viewing and editing of source files.

With help of list box [HTML-brawser] you can choose available on your computer browser, which will be used for viewing source HTML-texts and output index.

In the field [Output buffer size] you can define size of console window in lines, in which you see processing log. This window is cleared and reused if length of log exceeds the defined size.

Check [ask confirmations] if you want to see dialog window with confirmation before execution drop/clear operations in File Tree.

Check [autocheck the program's update] if you want to automatically check wordTabulator's new versions when you start it.

5.9 Help

On pane [Help] you can open user's guide in default browser or manually check program's update from sourceforge site.

Помощь

6. Run word processing

Let us come back to pane [Source Files] and start processing by pressing button [Start word processing] . After that all project's parameters and list of source files will be automatically saved to project file (in our case it will be "c:\tmp\wt$proj.wt"). Afterwards project file can be used to repeat needed processing.

word processing

You can see progress-indicator and timer during processing. Press on button [Stop] or input Ctrl+C to stop processing in any time.

Finish

After finish of processing you can see the resulting statistics in console window. So, we are ready to view the output index.

7. Browse results

To browse output index you should press button , which getting available after finish of word processing. On pane [Preferences] you can define HTML-browser which shall be used to view output index. Choose appropriate name in a list [HTML-browser]. It maybe Microsoft Internet Explorer, Mozilla Firefox, Google Chrome or Apple Safari provided that they installed on your computer. By default "Internal browser" is used. This embedded in wordTabulator browser let you to view marked context for every index's element.

Output index

HTML-index contains hyperlinks to source context for every word. These links are numbered by word frequency in descending order (number in round brakets). When you click link in internal browser you will see marked context and can use navigational menu to move from one marked word to other. For example if we click on link 1(3) for word "Александрову":

Context

To return to output index you should just close context window. Please note, that in some cases internal browser could not find word in contex. For example, in output index item is "Александр", but in source context it presented as "<b>А</b>лександр" - with initial bold letter. It's a pity, but internal browser will be useless to view text in UTF-8 encoding and XML-documents too.

8. Post-processing of output index

Post-processing is used for final transformation or analysis of output word index. For example, you can:

Each scenario of post-processing is implemented with help of script-file executed by utility xMarkup. Full path to script is specified in the field [Post-processing of index] on the pane [Output index]. To fast choose the script use button [...] right the field. To open script in external editor use button [Edit]. Creation of xMarkup scripts is additional task, which is not considered here (please see for deatails xMarkup user's guide).

Post-processing script

Examples of scripts for post-processing of output index in HTML or text table format you can find in folders bin/scripts/html and bin/scripts/text. You should understand that chosen script must conform to format of output index. It's no sense to run script for text table to process HTML-index and vice verse.

Post-processing is started automatically any time after successful completition of word processing. Picture below illustrates output of post-processing script count_word_len.html.par, which was used for output index generated from Alexander Pushkin's verses. This script calculates distribution of words' lengths in the index. As you can see the most frequently words have length of 5 characters (14,25% cases).

Results of post-processing

Next picture depicits the histogram of characters' frequencies in Alexander Pushkin's verses. This distribution was calculated with help of script count_char_freq.html.par.

Results of post-processing

To switch off the post-processing you shall just clear the field [Post-processing of index].

© Sergey Logichev, 1997-2020