Vim Tips Wiki
Register
No edit summary
 
(Change <tt> to <code>, perhaps also minor tweak.)
 
(14 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 
{{review}}
 
{{review}}
  +
{{TipImported
{{Tip
 
 
|id=1074
 
|id=1074
  +
|previous=1073
|title=Detect encoding from the charset specified in HTML files
 
  +
|next=1077
|created=December 9, 2005 22:41
+
|created=2005
 
|complexity=advanced
 
|complexity=advanced
 
|author=Wu Yongwei
 
|author=Wu Yongwei
 
|version=6.0
 
|version=6.0
 
|rating=3/3
 
|rating=3/3
  +
|category1=Encoding
|text=
 
  +
|category2=File Handling
If one needs to edit files encoded in multiple legacy encodings, then the Vim fileencodings option cannot help much. Some hacks can be used to put the file encoding in the file (see Tip &#35;911). However, in the case of HTML files, the encoding information is often in the HTML file already, esp. for non-Latin1 Web pages, i.e.:
 
  +
|category3=HTML
 
}}
 
If one needs to edit files encoded in multiple legacy encodings, then the Vim fileencodings option cannot help much. Some hacks can be used to put the file encoding in the file (see [[VimTip911]]). However, in the case of HTML files, the encoding information is often in the HTML file already, especially for non-Latin1 Web pages, for example:
  +
<pre>
 
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" >
  +
</pre>
   
 
The following code can be put in vimrc to detect and use such an encoding specification:
  +
<pre>
 
if has('autocmd')
 
function! ConvertHtmlEncoding(encoding)
 
if a:encoding ==? 'gb2312'
 
return 'cp936' " GB2312 imprecisely means CP936 in HTML
 
elseif a:encoding ==? 'iso-8859-1'
 
return 'latin1' " The canonical encoding name in Vim
 
elseif a:encoding ==? 'utf8'
 
return 'utf-8' " Other encoding aliases should follow here
  +
else
 
return a:encoding
  +
endif
  +
endfunction
   
 
function! DetectHtmlEncoding()
 
if &filetype != 'html'
  +
return
  +
endif
 
normal m`
 
normal gg
 
if search('\c<meta[ \t\n]\+http-equiv=\("\?\)Content-Type\1[ \t\n]\+content="text/html;[ \t\n]*charset=[-A-Za-z0-9_]\+"[ \t\n]*>') != 0
 
let reg_bak=@"
 
normal y$
 
let charset=matchstr(@", 'text/html; charset=\zs[-A-Za-z0-9_]\+')
 
let charset=ConvertHtmlEncoding(charset)
 
normal ``
 
let @"=reg_bak
 
if &fileencodings == ''
 
let auto_encodings=',' . &encoding . ','
  +
else
 
let auto_encodings=',' . &fileencodings . ','
  +
endif
 
if charset !=? &fileencoding &&
 
\auto_encodings =~ ',' . &fileencoding . ','
 
silent! exec 'e ++enc=' . charset
  +
endif
  +
else
 
normal ``
  +
endif
  +
endfunction
   
 
" Detect charset encoding in an HTML file
&lt;meta http-equiv="Content-Type" content="text/html; charset=gb2312"&gt;
 
 
au BufReadPost *.htm* nested call DetectHtmlEncoding()
 
endif
  +
</pre>
   
 
Please notice that the nested autocommand is used to ensure the syntax highlighting is OK and the remembered cursor position is still kept.
   
 
It is recommended to use <code>set encoding=utf-8</code> in order to ensure successful encoding conversion.
   
  +
==Plugins==
The following code can be put in _vimrc to detect and use such encoding specification:
 
  +
*{{script|id=2721|text=AutoFenc.vim}}
  +
*{{script|id=199|text=charset.vim}}
  +
*{{script|id=1708|text=FencView.vim}}
   
 
==Comments==
  +
The following source code form is common for generated pages:
  +
<pre>
  +
<meta content="text/html &hellip;" http-equiv="Content-Type" >
  +
</pre>
   
  +
This form will not be recognised.
   
  +
It would be reasonable to limit the search to the document head, expressed as an absolute characters to scan. This restriction will cause pages containing lots of comments and white space in head to be left alone. I do not think this is much of a problem.
---- code begins -----
 
 
if has('autocmd')
 
 
function! ConvertHtmlEncoding(encoding)
 
 
if a:encoding ==? 'gb2312'
 
 
return 'cp936' " GB2312 imprecisely means CP936 in HTML
 
 
elseif a:encoding ==? 'iso-8859-1'
 
 
return 'latin1' " The canonical encoding name in Vim
 
 
elseif a:encoding ==? 'utf8'
 
 
return 'utf-8' " Other encoding aliases should follow here
 
 
else
 
 
return a:encoding
 
 
endif
 
 
endfunction
 
 
 
 
function! DetectHtmlEncoding()
 
 
if &amp;filetype != 'html'
 
 
return
 
 
endif
 
 
normal m`
 
 
normal gg
 
 
if search('\c&lt;meta http-equiv=\("\?\)Content-Type\1 content="text/html; charset=[-A-Za-z0-9_]\+"&gt;') != 0
 
 
let reg_bak=@"
 
 
normal y$
 
 
let charset=matchstr(@", 'text/html; charset=\zs[-A-Za-z0-9_]\+')
 
 
let charset=ConvertHtmlEncoding(charset)
 
 
normal ``
 
 
let @"=reg_bak
 
 
if &amp;fileencodings == ''
 
 
let auto_encodings=',' . &amp;encoding . ','
 
 
else
 
 
let auto_encodings=',' . &amp;fileencodings . ','
 
 
endif
 
 
if charset !=? &amp;fileencoding &amp;&amp;
 
 
\auto_encodings =~ ',' . &amp;fileencoding . ','
 
 
silent! exec 'e ++enc=' . charset
 
 
endif
 
 
else
 
 
normal ``
 
 
endif
 
 
endfunction
 
 
 
 
" Detect charset encoding in an HTML file
 
 
au BufReadPost *.htm* nested call DetectHtmlEncoding()
 
 
---- code ends -----
 
 
 
 
Please notice that the nested autocommand is used to ensure the syntax highlighting is OK and the remembered cursor position is still kept.
 
 
 
 
It is recommended to use `set encoding=utf-8' in order to ensure successful encoding conversion.
 
}}
 
 
== Comments ==
 
Remember the final 'endif'...
 
 
wolcendo--AT--friko2.onet.pl
 
, December 21, 2005 3:32
 
 
----
 
----
  +
Version vim7.3_v7 or higher of the {{help|prefix=no|:TOhtml}} plugin distributed with Vim includes an autoload function you could call that does a much more complete HTML-charset to Vim encoding conversion. --[[User:Fritzophrenic|Fritzophrenic]] 16:30, November 15, 2010 (UTC)
<!-- parsed by vimtips.py in 0.682438 seconds-->
 
  +
:This is now done in the AutoFenc.vim plugin mentioned above. For an example, see the plugin code. --[[User:Fritzophrenic|Fritzophrenic]] 22:24, April 4, 2011 (UTC)

Latest revision as of 06:07, 13 July 2012

Tip 1074 Printable Monobook Previous Next

created 2005 · complexity advanced · author Wu Yongwei · version 6.0


If one needs to edit files encoded in multiple legacy encodings, then the Vim fileencodings option cannot help much. Some hacks can be used to put the file encoding in the file (see VimTip911). However, in the case of HTML files, the encoding information is often in the HTML file already, especially for non-Latin1 Web pages, for example:

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" >

The following code can be put in vimrc to detect and use such an encoding specification:

if has('autocmd')
  function! ConvertHtmlEncoding(encoding)
    if a:encoding ==? 'gb2312'
      return 'cp936' " GB2312 imprecisely means CP936 in HTML
    elseif a:encoding ==? 'iso-8859-1'
      return 'latin1' " The canonical encoding name in Vim
    elseif a:encoding ==? 'utf8'
      return 'utf-8' " Other encoding aliases should follow here
    else
      return a:encoding
    endif
  endfunction

  function! DetectHtmlEncoding()
    if &filetype != 'html'
      return
    endif
    normal m`
    normal gg
    if search('\c<meta[ \t\n]\+http-equiv=\("\?\)Content-Type\1[ \t\n]\+content="text/html;[ \t\n]*charset=[-A-Za-z0-9_]\+"[ \t\n]*>') != 0
      let reg_bak=@"
      normal y$
      let charset=matchstr(@", 'text/html; charset=\zs[-A-Za-z0-9_]\+')
      let charset=ConvertHtmlEncoding(charset)
      normal ``
      let @"=reg_bak
      if &fileencodings == ''
        let auto_encodings=',' . &encoding . ','
      else
        let auto_encodings=',' . &fileencodings . ','
      endif
      if charset !=? &fileencoding &&
            \auto_encodings =~ ',' . &fileencoding . ','
        silent! exec 'e ++enc=' . charset
      endif
    else
      normal ``
    endif
  endfunction

  " Detect charset encoding in an HTML file
  au BufReadPost *.htm* nested call DetectHtmlEncoding()
endif

Please notice that the nested autocommand is used to ensure the syntax highlighting is OK and the remembered cursor position is still kept.

It is recommended to use set encoding=utf-8 in order to ensure successful encoding conversion.

Plugins[]

Comments[]

The following source code form is common for generated pages:

<meta content="text/html …" http-equiv="Content-Type" >

This form will not be recognised.

It would be reasonable to limit the search to the document head, expressed as an absolute characters to scan. This restriction will cause pages containing lots of comments and white space in head to be left alone. I do not think this is much of a problem.


Version vim7.3_v7 or higher of the :TOhtml plugin distributed with Vim includes an autoload function you could call that does a much more complete HTML-charset to Vim encoding conversion. --Fritzophrenic 16:30, November 15, 2010 (UTC)

This is now done in the AutoFenc.vim plugin mentioned above. For an example, see the plugin code. --Fritzophrenic 22:24, April 4, 2011 (UTC)