Apache Tika code to detect language from text

In this code, you first create an input stream for your text. Then, you use the CharsetDetector class to detect the character encoding of the text. Finally, you use the LanguageIdentifier class to detect the language of the text.

Note that this code assumes that your text is in plain text format. If your text is in a different format, such as HTML or PDF, you will need to use a Tika parser to extract the plain text from the document before detecting the language.

import org.apache.tika.language.LanguageIdentifier;
import org.apache.tika.parser.txt.CharsetDetector;

import java.io.InputStream;

public class LanguageDetection {

    public static void main(String[] args) throws Exception {
        InputStream stream = // your text input stream

        // Detect the character encoding of the text
        CharsetDetector detector = new CharsetDetector();
        detector.setText(stream);
        String charset = detector.detect().getName();

        // Detect the language of the text
        LanguageIdentifier identifier = new LanguageIdentifier(stream);
        String language = identifier.getLanguage();

        System.out.println("Character Encoding: " + charset);
        System.out.println("Language: " + language);
    }

}

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.