Apache Tika to parse and extract text from HTML or PDF documents using Java

In this code, you first create a File object for your HTML or PDF document. Then, you create a Tika AutoDetectParser object to automatically detect the document format. You also create a Metadata object to store the document metadata, a BodyContentHandler object to receive the extracted text, and a ParseContext object to configure the parser.

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;

import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.sax.BodyContentHandler;

public class TikaParser {

    public static void main(String[] args) throws Exception {

        // Replace "input.html" with the path to your HTML or PDF document
        File inputFile = new File("input.html");

        // Create a Tika parser object
        AutoDetectParser parser = new AutoDetectParser();

        // Create a metadata object to store the document metadata
        Metadata metadata = new Metadata();

        // Create a handler to receive the extracted text
        BodyContentHandler handler = new BodyContentHandler();

        // Create a context object to configure the parser
        ParseContext context = new ParseContext();

        // Set the HTML parser if the input is an HTML file
        if (inputFile.getName().endsWith(".html")) {
            HtmlParser htmlParser = new HtmlParser();
            htmlParser.parse(new FileInputStream(inputFile), handler, metadata, context);
        } else {
            // Otherwise, assume the input is a PDF file
            InputStream stream = new FileInputStream(inputFile);
            parser.parse(stream, handler, metadata, context);
        }

        // Print the extracted text
        System.out.println(handler.toString());
    }
}

If the input file is an HTML file, you create a HtmlParser object to parse the HTML and extract the text. Otherwise, if the input file is a PDF file, you use the AutoDetectParser to detect the format and extract the text.

Finally, you print the extracted text using the toString() method of the BodyContentHandler.

Note that this code assumes that you have Apache Tika and all its dependencies installed and configured in your Java project.

Smart Source Blog

Apache Tika to parse and extract text from HTML or PDF documents using Java

Leave a Reply Cancel reply