Wikia

Vim Tips Wiki

Changes: Remove diacritical signs from characters

Edit

Back to page

(Fix script and add another, add explanation, what should title be?)

Revision as of 04:36, April 7, 2012

Proposed tip Please edit this page to improve it, or add your comments below (do not use the discussion page).

Please use new tips to discuss whether this page should be a permanent tip, or whether it should be merged to an existing tip.
created August 31, 2011 · complexity basic · author Marcmontu · version 7.0

Characters such as 'á' or 'ç' (with diacritics) may be included in code comments and are successfully processed by many tools. However some tools do not work with these characters.

Instead of changing the tools, it is common to remove the diacritical signs (e.g. replace á with a, and ç with c). This tip provides a script to do the necessary substitutions in a single step.

Script

Create file ~/.vim/plugin/diacritics.vim (Unix) or $HOME/vimfiles/plugin/diacritics.vim (Windows) containing one of the scripts below, then restart Vim. Alternatively, add one of the scripts to your vimrc and restart Vim.

The following script uses Vim's tr() to translate characters with diacritics to characters without. It loads the buffer into a variable in memory and translates all characters in one operation, so it is efficient. However, there is no opportunity to review changes as they occur.

" Remove diacritical signs from characters in specified range of lines.
" Examples of characters replaced: á -> a, ç -> c, Á -> A, Ç -> C.
function! s:RemoveDiacritics(line1, line2)
  let diacs = 'áâãàçéêíóôõüú'  " lowercase diacritical signs
  let repls = 'aaaaceeiooouu'  " corresponding replacements
  let diacs .= toupper(diacs)
  let repls .= toupper(repls)
  let all = join(getline(a:line1, a:line2), "\n")
  call setline(a:line1, split(tr(all, diacs, repls), "\n"))
endfunction
command! -range=% RemoveDiacritics call s:RemoveDiacritics(<line1>, <line2>)

The following alternative script uses :s to search and replace, with an opportunity to confirm each change so it can be reviewed (or press a to proceed with all changes). The substitute uses a replacement expression (\=) to look up the translation character in a dictionary.

" Remove diacritical signs from characters in specified range of lines.
" Examples of characters replaced: á -> a, ç -> c, Á -> A, Ç -> C.
" Uses substitute so changes can be confirmed.
function! s:RemoveDiacritics(line1, line2)
  let diacs = 'áâãàçéêíóôõüú'  " lowercase diacritical signs
  let repls = 'aaaaceeiooouu'  " corresponding replacements
  let diacs .= toupper(diacs)
  let repls .= toupper(repls)
  let diaclist = split(diacs, '\zs')
  let repllist = split(repls, '\zs')
  let trans = {}
  for i in range(len(diaclist))
    let trans[diaclist[i]] = repllist[i]
  endfor
  execute a:line1.','.a:line2 . 's/['.diacs.']/\=trans[submatch(0)]/gIce'
endfunction
command! -range=% RemoveDiacritics call s:RemoveDiacritics(<line1>, <line2>)

Each alternative script above defines the :RemoveDiacritics command, and the command accepts a range which defaults to the whole buffer. Some examples follow (type the first couple of letters of the command then press Tab for command completion, or press the up arrow for command history):

:RemoveDiacritics               " whole buffer
:.RemoveDiacritics              " current line
:'<,'>RemoveDiacritics          " last selected range of lines

See ranges for more information.

References

Comments

Clever! I'd recommend a :command instead of a mapping for something like this, however. You could even set up the command to give it a range which you could then pass to the substitute command. When doing this, it is possible to default to a range of '%'. See :help :command-range.

The separation of a string of characters and their diacritics may run into problems in multibyte encodings. In Vim, strlen and strpart act on BYTE indices, not on CHARACTER indices. A better solution would probably set up lists directly instead of creating a string first, e.g.
let diacChars = ['á','â','ã','à','ç','é','ê','í','ó','ô','õ','ü','ú']
. If this is too awkward to input, you could make a delimited string and use the split() function, e.g.
let diacChars=split("á â ã à ç é ê í ó ô õ ü ú")

--Fritzophrenic 20:00, August 31, 2011 (UTC)

You could use something like this
let diacChars="áâãàçéêíóôõüú'
and make a list out of it:
let diacCharsList=split(diacChars, '\zs')
Chrisbra 11:51, September 4, 2011 (UTC)

It would be so much easier to just loop from the start line to the end line and just setline() with tr( getline( lineNumber ), diacChars, replacementChars ). --December 6, 2011

Explanations

I've finally got around to examining this tip, which is most useful and interesting, thanks! I have replaced the script and added another, and am writing this to explain my changes. I will remove all comments fairly soon when assigning this a tip number, but the text will be available in this history if the author returns (last edit was in August 2011).

It was good of the author to credit Stackoverflow where there is an example of using a dictionary to do a multiple substitute (but not related to diacritics), however we don't need a permanently displayed record of that so I have removed it.

As Ben suggests above, the original script has a bug in that it uses strpart() (which operates on bytes) in an attempt to index characters. I guess it must have worked on the author's system, but it completely fails on mine (where I use set encoding=utf-8). Trying to use the old script gives "E716: Key not present in Dictionary: á" and investigation shows that the dictionary contains junk keys and values because it uses bytes taken from random places in the string.

Also the anon's comment above about using tr() gives a much more efficient solution, although it just does the replacements with no chance to confirm what one is doing (which you probably should do when using a big substitute on a program, unless you diff afterwards). I made a big (and very artificial) test file and on a slowish system, the substitute version took around 1 or 2 seconds to do 52,000 substitutions on 2000 lines (very acceptable, and amazingly fast really). The tr() version is about instantaneous.

I used a command (which accepts a range) for both versions as that is a lot easier than yet another mapping, and the command is available with tab completion and in command history. JohnBeckett 02:02, April 6, 2012 (UTC)

Title

What are we going to do about the current title:

Remove Diacritic signs/marks from characters (replace to regular character)

How about:

Remove diacritical signs from characters

and make a redirect:

Replace diacritics with regular characters

JohnBeckett 02:02, April 6, 2012 (UTC)

Around Wikia's network

Random Wiki