Apache Tika code to detect language from text

In this code, you first create an input stream for your text. Then, you use the CharsetDetector class to detect the character encoding of the text. Finally, you use the LanguageIdentifier class to detect the language of the text.

Note that this code assumes that your text is in plain text format. If your text is in a different format, such as HTML or PDF, you will need to use a Tika parser to extract the plain text from the document before detecting the language.

import org.apache.tika.language.LanguageIdentifier;
import org.apache.tika.parser.txt.CharsetDetector;

import java.io.InputStream;

public class LanguageDetection {

    public static void main(String[] args) throws Exception {
        InputStream stream = // your text input stream

        // Detect the character encoding of the text
        CharsetDetector detector = new CharsetDetector();
        detector.setText(stream);
        String charset = detector.detect().getName();

        // Detect the language of the text
        LanguageIdentifier identifier = new LanguageIdentifier(stream);
        String language = identifier.getLanguage();

        System.out.println("Character Encoding: " + charset);
        System.out.println("Language: " + language);
    }

}

Smart Source Blog

Apache Tika code to detect language from text

Leave a Reply Cancel reply