Apache Tika code to detect language from text
In this code, you first create an input stream for your text. Then, you use the CharsetDetector
class to detect the character encoding of the text. Finally, you use the LanguageIdentifier
class to detect the language of the text.
Note that this code assumes that your text is in plain text format. If your text is in a different format, such as HTML or PDF, you will need to use a Tika parser to extract the plain text from the document before detecting the language.
import org.apache.tika.language.LanguageIdentifier;
import org.apache.tika.parser.txt.CharsetDetector;
import java.io.InputStream;
public class LanguageDetection {
public static void main(String[] args) throws Exception {
InputStream stream = // your text input stream
// Detect the character encoding of the text
CharsetDetector detector = new CharsetDetector();
detector.setText(stream);
String charset = detector.detect().getName();
// Detect the language of the text
LanguageIdentifier identifier = new LanguageIdentifier(stream);
String language = identifier.getLanguage();
System.out.println("Character Encoding: " + charset);
System.out.println("Language: " + language);
}
}