Wikia

Vim Tips Wiki

Changes: Remove diacritical signs from characters

Edit

Back to page

m (Suggest new plugin)
 
(7 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{TipProposed
+
{{TipNew
|id=0
+
|id=1675
|previous=0
+
|previous=1674
|next=0
+
|next=1676
 
|created=August 31, 2011
 
|created=August 31, 2011
 
|complexity=basic
 
|complexity=basic
Line 13: Line 13:
 
Characters such as 'á' or 'ç' (with [[wikipedia:diacritic|diacritics]]) may be included in code comments and are successfully processed by many tools. However some tools do not work with these characters.
 
Characters such as 'á' or 'ç' (with [[wikipedia:diacritic|diacritics]]) may be included in code comments and are successfully processed by many tools. However some tools do not work with these characters.
   
Instead of changing the tools, it is common to remove the diacritical signs (e.g. replace <tt>á</tt> with <tt>a</tt>, and <tt>ç</tt> with <tt>c</tt>). This tip provides a script to do the necessary substitutions in a single step.
+
Instead of changing the tools, it is common to remove the diacritical signs (e.g. replace <code>á</code> with <code>a</code>, and <code>ç</code> with <code>c</code>). This tip provides a script to do the necessary substitutions in a single step.
   
 
==Script==
 
==Script==
Create file <tt>~/.vim/plugin/diacritics.vim</tt> (Unix) or <tt>$HOME/vimfiles/plugin/diacritics.vim</tt> (Windows) containing one of the scripts below, then restart Vim. Alternatively, add one of the scripts to your [[vimrc]] and restart Vim.
+
Create file <code>~/.vim/plugin/diacritics.vim</code> (Unix) or <code>$HOME/vimfiles/plugin/diacritics.vim</code> (Windows) containing one of the scripts below, then restart Vim. Alternatively, add one of the scripts to your [[vimrc]] and restart Vim.
   
The following script uses Vim's <tt>tr()</tt> to translate characters with diacritics to characters without. It loads the buffer into a variable in memory and translates all characters in one operation, so it is efficient. However, there is no opportunity to review changes as they occur.
+
The following script uses Vim's <code>tr()</code> to translate characters with diacritics to characters without. It loads the buffer into a variable in memory and translates all characters in one operation, so it is efficient. However, there is no opportunity to review changes as they occur.
 
<pre>
 
<pre>
 
" Remove diacritical signs from characters in specified range of lines.
 
" Remove diacritical signs from characters in specified range of lines.
Line 33: Line 33:
 
</pre>
 
</pre>
   
The following alternative script uses <tt>:s</tt> to search and replace, with an opportunity to confirm each change so it can be reviewed (or press <tt>a</tt> to proceed with all changes). The substitute uses a replacement expression (<tt>\=</tt>) to look up the translation character in a dictionary.
+
The following alternative script uses <code>:s</code> to search and replace, with an opportunity to confirm each change so it can be reviewed (or press <code>a</code> to proceed with all changes). The substitute uses a replacement expression (<code>\=</code>) to look up the translation character in a dictionary.
 
<pre>
 
<pre>
 
" Remove diacritical signs from characters in specified range of lines.
 
" Remove diacritical signs from characters in specified range of lines.
Line 54: Line 54:
 
</pre>
 
</pre>
   
Each alternative script above defines the <tt>:RemoveDiacritics</tt> command, and the command accepts a range which defaults to the whole buffer. Some examples follow (type the first couple of letters of the command then press Tab for command completion, or press the up arrow for command history):
+
Each alternative script above defines the <code>:RemoveDiacritics</code> command, and the command accepts a range which defaults to the whole buffer. Some examples follow (type the first couple of letters of the command then press Tab for command completion, or press the up arrow for command history):
 
<pre>
 
<pre>
 
:RemoveDiacritics " whole buffer
 
:RemoveDiacritics " whole buffer
Line 67: Line 67:
   
 
==Comments==
 
==Comments==
Clever! I'd recommend a :command instead of a mapping for something like this, however. You could even set up the command to give it a range which you could then pass to the substitute command. When doing this, it is possible to default to a range of '%'. See {{help|:command-range}}.
 
   
The separation of a string of characters and their diacritics may run into problems in multibyte encodings. In Vim, strlen and strpart act on BYTE indices, not on CHARACTER indices. A better solution would probably set up lists directly instead of creating a string first, e.g. <pre>let diacChars = ['á','â','ã','à','ç','é','ê','í','ó','ô','õ','ü','ú']</pre>. If this is too awkward to input, you could make a delimited string and use the {{help|prefix=no|split()}} function, e.g. <pre>let diacChars=split("á â ã à ç é ê í ó ô õ ü ú")</pre>
+
Sorry for being away for so long.
--[[User:Fritzophrenic|Fritzophrenic]] 20:00, August 31, 2011 (UTC)
 
:: You could use something like this <pre>let diacChars="áâãàçéêíóôõüú'</pre>and make a list out of it: <pre>let diacCharsList=split(diacChars, '\zs')</pre>[[User:Chrisbra|Chrisbra]] 11:51, September 4, 2011 (UTC)
 
   
It would be so much easier to just loop from the start line to the end line and just setline() with tr( getline( lineNumber ), diacChars, replacementChars ). --December 6, 2011
+
Fritzophrenic, Chrisbra, JohnBot and JohnBeckett: thank you very much for improving the script and the tip!
   
===Explanations===
+
@Fritzophrenic, thanks for suggesting the command, I started using this approach for similar tasks. I wasn't aware of the problems in multibyte encoding, but for sure it is better to make it robust. In my opinion it is easier to maintain the diacritic signs and its replacements in a single string (easier to type, line is shorten so it is less likely to added a new diacritic char and forgetting to add its replacement), so I agree with Chrisbra, approach: <pre>let diacCharsList=split(diacChars, '\zs')</pre>
I've finally got around to examining this tip, which is most useful and interesting, thanks! I have replaced the script and added another, and am writing this to explain my changes. I will remove all comments fairly soon when assigning this a tip number, but the text will be available in this history if the author returns (last edit was in August 2011).
 
   
It was good of the author to credit [http://stackoverflow.com/questions/765894/can-i-substitute-multiple-items-in-a-single-regular-expression-in-vim-or-perl Stackoverflow] where there is an example of using a dictionary to do a multiple substitute (but not related to diacritics), however we don't need a permanently displayed record of that so I have removed it.
+
I didn't understand the changes to the References section, but that is probably because I'm new to wiki editing. I'd be glad if someone can explain that to me :)
   
As Ben suggests above, the original script has a bug in that it uses <tt>strpart()</tt> (which operates on ''bytes'') in an attempt to index characters. I guess it must have worked on the author's system, but it completely fails on mine (where I use {{tt|1=set encoding=utf-8}}). Trying to use the old script gives "E716: Key not present in Dictionary: á" and investigation shows that the dictionary contains junk keys and values because it uses bytes taken from random places in the string.
+
Here are the original intentions:
   
Also the anon's comment above about using <tt>tr()</tt> gives a much more efficient solution, although it just does the replacements with no chance to confirm what one is doing (which you probably should do when using a big substitute on a program, unless you diff afterwards). I made a big (and very artificial) test file and on a slowish system, the substitute version took around 1 or 2 seconds to do 52,000 substitutions on 2000 lines (very acceptable, and amazingly fast really). The <tt>tr()</tt> version is about instantaneous.
+
1) http://en.wikipedia.org/wiki/Diacritic
  +
As English is not my first language, I spent some time to find a keyword when I was searching for a way to remove the diacritical signs. When I posted the tip I thought that if it already existed and I've found it, my first reaction would be "what is diacritical??" - and probably my first guess would be that it is Vim parlance.
   
I used a command (which accepts a range) for both versions as that is a lot easier than yet another mapping, and the command is available with tab completion and in command history. [[User:JohnBeckett|JohnBeckett]] 02:02, April 6, 2012 (UTC)
+
2) http://stackoverflow.com/questions/765894/can-i-substitute-multiple-items-in-a-single-regular-expression-in-vim-or-perl
  +
After I've give up searching for an existing way of replacing the diacritic characters and decided to create the script, I spent some time thinking on how to write it. I had the idea of performing all the changes with a single <code>:s</code>, but was unable to figure out how. Therefore I copied it from another person, and referencing that page was intending to given him the credit for the implementation.
  +
I thought that it could also be useful to understand the implementation. The comment line <pre>"exe ":%s/[ãáâ]/\={'ã':'a','á':'a','â':'a'}[submatch(0)]/gIc"</pre> was also with the purpose of explaining the unusual line for someone attempting to change/improve it.
   
===Title===
+
Thanks
What are we going to do about the current title:
+
-- marcmontu - 12:37, Friday, July 13, 2012 (UTC)
:Remove Diacritic signs/marks from characters (replace to regular character)
+
:We remove comments after a decent amount of time has passed so readers can see useful material more clearly. My April 6 comments can be seen by following the history tab that used to be visible at the top of each page (thanks Wikia!), or more simply [http://vim.wikia.com/wiki/Remove_diacritical_signs_from_characters?oldid=33051 here]. I put the Wikipedia link in the first line of the tip: "(with [[wikipedia:diacritic|diacritics]])". There's nothing particularly helpful on the Stackoverflow page, and whereas you might have got an idea from there, it's pretty hard to tell where the idea originally came from. We can't give credits (although it is in the history) because just above every sentence of a useful tip uses ideas gleaned from reading the Vim mailing lists, or other tips, or other websites. Likewise, other places with Vim info can't give proper credits either. You might like to copy the "Read more" section from below to your user page ([[User:Marcmontu]]) because it's a bit of place here. [[User:JohnBeckett|JohnBeckett]] ([[User talk:JohnBeckett|talk]]) 03:39, July 14, 2012 (UTC)
How about:
+
::OMG the "read more" section is something added by Wikia, and has existed since 2010! Perhaps we had it disabled here and something has enabled it? Or a recent upgrade of something has changed the way it looks, so I wasn't aware that it is an automated nuisance from Wikia? For my curiosity, I'm adding a magic word that may or may not remove it from this page. [[User:JohnBeckett|JohnBeckett]] ([[User talk:JohnBeckett|talk]]) 09:04, July 19, 2012 (UTC)
:Remove diacritical signs from characters
+
__NORELATEDARTICLES__
and make a redirect:
+
:Replace diacritics with regular characters
+
If you have Linux and glibc, can you please try my [https://github.com/kubahorak/diacritic plugin Diacritic]? It's more universal, because it uses <code>iconv</code> command to transliterate.
[[User:JohnBeckett|JohnBeckett]] 02:02, April 6, 2012 (UTC)
+
--[[User:Haisaul|Haisaul]] ([[User talk:Haisaul|talk]]) 14:35, September 24, 2014 (UTC)

Latest revision as of 14:35, September 24, 2014

Tip 1675 Printable Monobook Previous Next

created August 31, 2011 · complexity basic · author Marcmontu · version 7.0


Characters such as 'á' or 'ç' (with diacritics) may be included in code comments and are successfully processed by many tools. However some tools do not work with these characters.

Instead of changing the tools, it is common to remove the diacritical signs (e.g. replace á with a, and ç with c). This tip provides a script to do the necessary substitutions in a single step.

ScriptEdit

Create file ~/.vim/plugin/diacritics.vim (Unix) or $HOME/vimfiles/plugin/diacritics.vim (Windows) containing one of the scripts below, then restart Vim. Alternatively, add one of the scripts to your vimrc and restart Vim.

The following script uses Vim's tr() to translate characters with diacritics to characters without. It loads the buffer into a variable in memory and translates all characters in one operation, so it is efficient. However, there is no opportunity to review changes as they occur.

" Remove diacritical signs from characters in specified range of lines.
" Examples of characters replaced: á -> a, ç -> c, Á -> A, Ç -> C.
function! s:RemoveDiacritics(line1, line2)
  let diacs = 'áâãàçéêíóôõüú'  " lowercase diacritical signs
  let repls = 'aaaaceeiooouu'  " corresponding replacements
  let diacs .= toupper(diacs)
  let repls .= toupper(repls)
  let all = join(getline(a:line1, a:line2), "\n")
  call setline(a:line1, split(tr(all, diacs, repls), "\n"))
endfunction
command! -range=% RemoveDiacritics call s:RemoveDiacritics(<line1>, <line2>)

The following alternative script uses :s to search and replace, with an opportunity to confirm each change so it can be reviewed (or press a to proceed with all changes). The substitute uses a replacement expression (\=) to look up the translation character in a dictionary.

" Remove diacritical signs from characters in specified range of lines.
" Examples of characters replaced: á -> a, ç -> c, Á -> A, Ç -> C.
" Uses substitute so changes can be confirmed.
function! s:RemoveDiacritics(line1, line2)
  let diacs = 'áâãàçéêíóôõüú'  " lowercase diacritical signs
  let repls = 'aaaaceeiooouu'  " corresponding replacements
  let diacs .= toupper(diacs)
  let repls .= toupper(repls)
  let diaclist = split(diacs, '\zs')
  let repllist = split(repls, '\zs')
  let trans = {}
  for i in range(len(diaclist))
    let trans[diaclist[i]] = repllist[i]
  endfor
  execute a:line1.','.a:line2 . 's/['.diacs.']/\=trans[submatch(0)]/gIce'
endfunction
command! -range=% RemoveDiacritics call s:RemoveDiacritics(<line1>, <line2>)

Each alternative script above defines the :RemoveDiacritics command, and the command accepts a range which defaults to the whole buffer. Some examples follow (type the first couple of letters of the command then press Tab for command completion, or press the up arrow for command history):

:RemoveDiacritics               " whole buffer
:.RemoveDiacritics              " current line
:'<,'>RemoveDiacritics          " last selected range of lines

See ranges for more information.

ReferencesEdit

CommentsEdit

Sorry for being away for so long.

Fritzophrenic, Chrisbra, JohnBot and JohnBeckett: thank you very much for improving the script and the tip!

@Fritzophrenic, thanks for suggesting the command, I started using this approach for similar tasks. I wasn't aware of the problems in multibyte encoding, but for sure it is better to make it robust. In my opinion it is easier to maintain the diacritic signs and its replacements in a single string (easier to type, line is shorten so it is less likely to added a new diacritic char and forgetting to add its replacement), so I agree with Chrisbra, approach:
let diacCharsList=split(diacChars, '\zs')

I didn't understand the changes to the References section, but that is probably because I'm new to wiki editing. I'd be glad if someone can explain that to me :)

Here are the original intentions:

1) http://en.wikipedia.org/wiki/Diacritic As English is not my first language, I spent some time to find a keyword when I was searching for a way to remove the diacritical signs. When I posted the tip I thought that if it already existed and I've found it, my first reaction would be "what is diacritical??" - and probably my first guess would be that it is Vim parlance.

2) http://stackoverflow.com/questions/765894/can-i-substitute-multiple-items-in-a-single-regular-expression-in-vim-or-perl After I've give up searching for an existing way of replacing the diacritic characters and decided to create the script, I spent some time thinking on how to write it. I had the idea of performing all the changes with a single :s, but was unable to figure out how. Therefore I copied it from another person, and referencing that page was intending to given him the credit for the implementation.

I thought that it could also be useful to understand the implementation. The comment line
"exe ":%s/[ãáâ]/\={'ã':'a','á':'a','â':'a'}[submatch(0)]/gIc"
was also with the purpose of explaining the unusual line for someone attempting to change/improve it.

Thanks -- marcmontu - 12:37, Friday, July 13, 2012 (UTC)

We remove comments after a decent amount of time has passed so readers can see useful material more clearly. My April 6 comments can be seen by following the history tab that used to be visible at the top of each page (thanks Wikia!), or more simply here. I put the Wikipedia link in the first line of the tip: "(with diacritics)". There's nothing particularly helpful on the Stackoverflow page, and whereas you might have got an idea from there, it's pretty hard to tell where the idea originally came from. We can't give credits (although it is in the history) because just above every sentence of a useful tip uses ideas gleaned from reading the Vim mailing lists, or other tips, or other websites. Likewise, other places with Vim info can't give proper credits either. You might like to copy the "Read more" section from below to your user page (User:Marcmontu) because it's a bit of place here. JohnBeckett (talk) 03:39, July 14, 2012 (UTC)
OMG the "read more" section is something added by Wikia, and has existed since 2010! Perhaps we had it disabled here and something has enabled it? Or a recent upgrade of something has changed the way it looks, so I wasn't aware that it is an automated nuisance from Wikia? For my curiosity, I'm adding a magic word that may or may not remove it from this page. JohnBeckett (talk) 09:04, July 19, 2012 (UTC)

__NORELATEDARTICLES__

If you have Linux and glibc, can you please try my plugin Diacritic? It's more universal, because it uses iconv command to transliterate. --Haisaul (talk) 14:35, September 24, 2014 (UTC)

Around Wikia's network

Random Wiki