(Insert TipProposed template + manual clean) |
(Change <tt> to <code>, perhaps also minor tweak.) |
||
(9 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
+ | {{TipNew |
||
− | {{TipProposed |
||
− | |id= |
+ | |id=1531 |
− | |previous= |
+ | |previous=1530 |
− | |next= |
+ | |next=1532 |
− | |created= |
+ | |created=2007 |
|complexity=basic |
|complexity=basic |
||
|author=vale.smth |
|author=vale.smth |
||
|version=7.0 |
|version=7.0 |
||
+ | |subpage=/200712 |
||
+ | |category1=Advanced Regex |
||
+ | |category2= |
||
}} |
}} |
||
− | + | This tip shows how to generate a table of occurrence frequencies for every word in the current buffer, or in selected text. For example, the results may include the following which shows that "action" was used 12 times, "agree" 7 times, and so on: |
|
+ | <pre> |
||
+ | action 12 |
||
+ | agree 7 |
||
+ | and 26 |
||
+ | </pre> |
||
+ | |||
+ | ==Using a dictionary== |
||
+ | Procedure: |
||
+ | *In Vim, copy the code shown below, then enter <code>:@"</code> to execute it (or put the code in your [[vimrc]]). |
||
+ | *Press <code>V</code> then move the cursor to select the lines whose words you want to count. |
||
+ | *Alternatively, select no lines, in which case all words in the buffer will be counted. |
||
+ | *Type <code>:WordFrequency</code> and press Enter. |
||
+ | A new window will open with a scratch buffer showing the word frequencies. |
||
+ | <pre> |
||
+ | function! WordFrequency() range |
||
+ | let all = split(join(getline(a:firstline, a:lastline)), '\A\+') |
||
+ | let frequencies = {} |
||
+ | for word in all |
||
+ | let frequencies[word] = get(frequencies, word, 0) + 1 |
||
+ | endfor |
||
+ | new |
||
+ | setlocal buftype=nofile bufhidden=hide noswapfile tabstop=20 |
||
+ | for [key,value] in items(frequencies) |
||
+ | call append('$', key."\t".value) |
||
+ | endfor |
||
+ | sort i |
||
+ | endfunction |
||
+ | command! -range=% WordFrequency <line1>,<line2>call WordFrequency() |
||
+ | </pre> |
||
+ | |||
+ | ==Using commands== |
||
+ | The following alternative demonstrates the amazing power of Ex commands. This process replaces the current buffer with a word frequency table, so you should be working on a copy of your text. |
||
+ | |||
+ | Enter the following commands. In the third line, the "<code>^A</code>" represents Ctrl-A which needs to be entered by pressing Ctrl-V then Ctrl-A (if you use Ctrl-V for paste, press Ctrl-Q then Ctrl-A): |
||
<pre> |
<pre> |
||
:%s/\_A\+/\t1\r/g |
:%s/\_A\+/\t1\r/g |
||
:sort i |
:sort i |
||
− | :g/\c\(.\+\)\n\1$/norm $yiwj@"^Akdd |
+ | :g/^\c\(.\+\)\n\1$/norm! $yiwj@"^Akdd |
</pre> |
</pre> |
||
+ | The first command replaces all sequences of not-word characters (including newlines, <code>\_A\+</code>) with "<code>\t1\r</code>" (a tab character, the digit <code>1</code>, and a newline). The result leaves only words, each followed by a count of 1, with a single word per line. |
||
− | In the above, <tt>^A</tt> represents CTRL-A, and should be input directly (press CTRL-V then CTRL-A). On Windows, you would probably need to press CTRL-Q then CTRL-A, and you would probably first have to issue the command <tt>:unmap <C-A></tt>. |
||
+ | The second command sorts the lines, ignoring case. The next step combines all lines containing the same word ignoring case. |
||
− | Note that the above changes the file, so you would want to be working on a copy of your text, or you will need to undo the changes. |
||
+ | The third command uses <code>:g///</code> to flag each line which is followed by another line containing the same text (<code>\1</code>), ignoring case (<code>\c</code>). The given normal-mode command is then executed on each flagged line: <code>$</code> moves to end-of-line, <code>yiw</code> copies inner word (the count); <code>j</code> goes down one line; <code>@"</code> effectively types the contents of the unnamed register (count copied from previous line), and that value is a repeat count for the Ctrl-A which increments the <code>1</code> that many times to accumulate the total; <code>k</code> goes up one line; <code>dd</code> deletes that line. |
||
⚫ | |||
− | An amazing tip, but someone should briefly explain how it works. |
||
+ | Possible mods: |
||
− | The 1st command makes every word per line, and appends "1" at the end of each line. |
||
+ | * use a more general substitute pattern for non-English texts: <code>:%s/\%(\K\@!\_.\)\+/\t1\r/g</code> <br/>This will keep all <code>'iskeyword'</code> characters (except digits) instead of only letters. |
||
+ | * at the end, add a command to sort after the counts: <code>:sort! n /\t/</code><br/>! - reverse sort, n - sort after numbers, /\t/ - only look at text right from the first Tab character |
||
⚫ | |||
− | The 2nd command sort lines. |
||
− | |||
− | The 3nd command find every pair of lines that are same, adds the number of the 1st line to the 2nd line's, then deletes the 1st line. |
||
− | |||
− | ---- |
||
− | [[Category:File Handling]] |
Revision as of 06:30, 13 July 2012
created 2007 · complexity basic · author vale.smth · version 7.0
This tip shows how to generate a table of occurrence frequencies for every word in the current buffer, or in selected text. For example, the results may include the following which shows that "action" was used 12 times, "agree" 7 times, and so on:
action 12 agree 7 and 26
Using a dictionary
Procedure:
- In Vim, copy the code shown below, then enter
:@"
to execute it (or put the code in your vimrc). - Press
V
then move the cursor to select the lines whose words you want to count. - Alternatively, select no lines, in which case all words in the buffer will be counted.
- Type
:WordFrequency
and press Enter.
A new window will open with a scratch buffer showing the word frequencies.
function! WordFrequency() range let all = split(join(getline(a:firstline, a:lastline)), '\A\+') let frequencies = {} for word in all let frequencies[word] = get(frequencies, word, 0) + 1 endfor new setlocal buftype=nofile bufhidden=hide noswapfile tabstop=20 for [key,value] in items(frequencies) call append('$', key."\t".value) endfor sort i endfunction command! -range=% WordFrequency <line1>,<line2>call WordFrequency()
Using commands
The following alternative demonstrates the amazing power of Ex commands. This process replaces the current buffer with a word frequency table, so you should be working on a copy of your text.
Enter the following commands. In the third line, the "^A
" represents Ctrl-A which needs to be entered by pressing Ctrl-V then Ctrl-A (if you use Ctrl-V for paste, press Ctrl-Q then Ctrl-A):
:%s/\_A\+/\t1\r/g :sort i :g/^\c\(.\+\)\n\1$/norm! $yiwj@"^Akdd
The first command replaces all sequences of not-word characters (including newlines, \_A\+
) with "\t1\r
" (a tab character, the digit 1
, and a newline). The result leaves only words, each followed by a count of 1, with a single word per line.
The second command sorts the lines, ignoring case. The next step combines all lines containing the same word ignoring case.
The third command uses :g///
to flag each line which is followed by another line containing the same text (\1
), ignoring case (\c
). The given normal-mode command is then executed on each flagged line: $
moves to end-of-line, yiw
copies inner word (the count); j
goes down one line; @"
effectively types the contents of the unnamed register (count copied from previous line), and that value is a repeat count for the Ctrl-A which increments the 1
that many times to accumulate the total; k
goes up one line; dd
deletes that line.
Possible mods:
- use a more general substitute pattern for non-English texts:
:%s/\%(\K\@!\_.\)\+/\t1\r/g
This will keep all'iskeyword'
characters (except digits) instead of only letters. - at the end, add a command to sort after the counts:
:sort! n /\t/
! - reverse sort, n - sort after numbers, /\t/ - only look at text right from the first Tab character