public class CharsetUtil extends Object
Utility class to guess the encoding of a given text file.
Unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are correctly discovered. For UTF-8 files with no BOM, if the buffer is wide enough, the charset should also be discovered.
A byte buffer of 4KB is usually sufficient to be able to guess the encoding.
Usage:
// guess the encoding Charset guessedCharset = CharsetToolkit.guessEncoding(file, 4096); // create a reader with the correct charset CharsetToolkit toolkit = new CharsetToolkit(file); BufferedReader reader = toolkit.getReader(); // read the file content String line; while ((line = br.readLine())!= null) { System.out.println(line); }author Guillaume Laforge
Constructor and Description |
---|
CharsetUtil(File file) |
CharsetUtil(InputStream inputStream)
Creates a
CharsetUtil for guessing from an input stream. |
Modifier and Type | Method and Description |
---|---|
Charset |
getCharset() |
Charset |
getDefaultCharset()
Retrieves the default Charset
|
boolean |
getEnforce8Bit()
Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.
|
boolean |
hasUTF16BEBom()
Has a Byte Order Marker for UTF-16 Big Endian
(utf-16 and ucs-2).
|
boolean |
hasUTF16LEBom()
Has a Byte Order Marker for UTF-16 Low Endian
(ucs-2le, ucs-4le, and ucs-16le).
|
boolean |
hasUTF8Bom()
Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).
|
void |
init(InputStream inputStream)
Initializes this
CharsetUtil for guessing from an input stream. |
void |
setDefaultCharset(Charset defaultCharset)
Defines the default
Charset used in case the buffer represents
an 8-bit Charset . |
void |
setEnforce8Bit(boolean enforce)
If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII.
|
public CharsetUtil(File file) throws IOException
file
- of which we want to know the encoding.IOException
public CharsetUtil(InputStream inputStream) throws IOException
CharsetUtil
for guessing from an input stream.inputStream
- the input stream.IOException
public void init(InputStream inputStream) throws IOException
CharsetUtil
for guessing from an input stream.inputStream
- the input stream.IOException
public void setDefaultCharset(Charset defaultCharset)
Charset
used in case the buffer represents
an 8-bit Charset
.defaultCharset
- the default Charset
to be returned by guessEncoding()
if an 8-bit Charset
is encountered.public Charset getCharset()
public void setEnforce8Bit(boolean enforce)
charset
rather than US-ASCII.enforce
- a boolean specifying the use or not of US-ASCII.public boolean getEnforce8Bit()
public Charset getDefaultCharset()
public boolean hasUTF8Bom()
public boolean hasUTF16LEBom()
public boolean hasUTF16BEBom()
LumisXP 12.3.0.200408 - Copyright © 2006–2020 Lumis EIP Tecnologia da Informação LTDA. All Rights Reserved.