Vim Tips Wiki
Register
(remove {{review}} and {{todo}} and add comment saying why)
Line 79: Line 79:
 
* {{help|:dig}}
 
* {{help|:dig}}
 
* {{help|i_CTRL-V_digit}}
 
* {{help|i_CTRL-V_digit}}
  +
* [[Entering special characters]]
   
 
==Comments==
 
==Comments==

Revision as of 08:20, 17 April 2008

Tip 246 Printable Monobook Previous Next

created May 10, 2002 · complexity basic · author Tony Mechelynck · version 6.0


What to do

  • The following is an example. Modify it to suit your work environment.
if has("multi_byte")
  if &termencoding == ""
    let &termencoding = &encoding
  endif
  set encoding=utf-8
  setglobal fileencoding=utf-8 bomb
  set fileencodings=ucs-bom,utf-8,latin1
endif

What the above does

  • has("multi_byte") checks if you have the right options compiled-in. If you haven't got what it takes, it's no use trying to use Unicode.
  • 'termencoding' defines how your keyboard encodes what you type. Here we save the value corresponding to your locale before changing 'encoding' (see below).
  • 'encoding' sets how vim shall represent characters internally. Utf-8 is necessary for most flavors of Unicode.
  • 'fileencoding' sets the encoding for a particular file (local to buffer); :setglobal sets the default value. An empty value can also be used: it defaults to same as 'encoding'. Or you may want to set one of the ucs encodings, It might make the same disk file bigger or smaller depending on your particular mix of characters. Also, IIUC, utf-8 is always big-endian (high bit first) while ucs can be big-endian or little-endian, so if you use it, you will probably need to set 'bomb" (see below).
  • 'bomb' (boolean): if set, vim will put a "byte order mark" (or BOM for short) at the start of Unicode files. This option is irrelevant for non-Unicode files (iso-8859, etc.). This BOM is the codepoint U+FEFF, which is represented on disk as follows:
    • UTF-8: EF BB BF
    • UTF-16be: FE FF
    • UTF-16le: FF FE
    • UTF-32be: 00 00 FE FF
    • UTF-32le: FF FE 00 00
That is, the BOM allows an easy determination of which Unicode encoding and which endianness are being used (assuming that a file in UTF-16le won't start with a NULL).
  • 'fileencodings' defines the heuristic to set 'fileencoding' (local to buffer) when reading an existing file. The first one that matches will be used. Ucs-bom is "ucs with byte-order-mark"; it must not come after utf-8 if you want it to be used.

Additional remarks

  • In "replace" mode, one utf character (one or more data bytes) replaces one utf character (which need not use the same number of bytes).
  • In "normal" mode, ga shows the character under the cursor as text, decimal, octal and hex; g8 shows which byte or bytes are used to represent it.
  • In "insert" or "replace" mode,
    • any character defined on your keyboard can be entered the usual way (even with dead keys if you have them, e.g. French circumflex, German umlaut, etc.);
    • any character which has a "digraph" (there are a huge lot of them, see :dig after setting enc=utf-8) can be entered with a Ctrl-K prefix;
    • any utf character at all can be entered with a Ctrl-V prefix, either <Ctrl-V> u aaaa or <Ctrl-V> U bbbbbbbb, with 0 <= aaaa <= FFFF, or 0 <= bbbbbbbb <= 7FFFFFFF.
    • If you have sourced mswin.vim (which I don't recommend) then <Ctrl-V> has been remapped to the "paste" operation; in this case you need to use <Ctrl-Q> instead.
  • Unicode can be used to create html "body text", at least for Gecko browsers (Netscape 6+, Firefox, SeaMonkey, ...) and probably for IE; but on my machine it doesn't display properly as "title text" (i.e., between <title></title> tags in the <head> part).
  • Gvim will display it properly if you have the fonts for it, provided that you set 'guifont' to some fixed-width font which has the glyphs you want to use (Courier New is OK for French, German, Greek, Russian and more, but I'm not sure about Hebrew or Arabic; its glyphs are of a more "fixed" width than those of, e.g. Lucida Console: the latter can be awkward if you need bold Cyrillic writing).
  • Currently, gvim displays any Unicode codepoint above U+FFFF as a question mark (double-width for CJK); this is a known bug, it is being worked upon.
    • Edit: This is fixed by patch 7.1.116 dated 2007 Sep 17 20:39. If your version includes this patch and still displays all characters above U+FFFF as question marks, please post in the vim_multibyte group at Google Groups -- Thanks in advance. Tonymec 03:29, 18 September 2007 (UTC)

See also

Comments

On a Windows XP system I had trouble making vim enter Greek input text in Unicode files using the Hellenic keyboard layout by setting termencoding to iso8859-7. Setting it to cp1253 solved the problem.


Can someone suggest a nice way for typing the unicode diacritical marks? For acute accent, I use:

imap <M-a> <C-V>u0301
map <C-a> a<C-V>u0301<Esc>

If you, like me, are a Far East (esp. CJK) user and need symbols like the dash to be double-width, you need the option:

set ambiwidth=double

About accents: see :help digraph.txt and in particular :help digraphs-default.

An a with grave (à) is Ctrl-K a ! (control-K, small-a-for-alfa, exclamation-mark).

A spacing grave accent (`) is Ctrl-K ' ! (control-K, apostrophe, exclamation-mark).