Apache Tika to parse and extract text from HTML or PDF documents using Java
In this code, you first create a File
object for your HTML or PDF document. Then, you create a Tika AutoDetectParser
object to automatically detect the document format. You also create a Metadata
object to store the document metadata, a BodyContentHandler
object to receive the extracted text, and a ParseContext
object to configure the parser.
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.sax.BodyContentHandler;
public class TikaParser {
public static void main(String[] args) throws Exception {
// Replace "input.html" with the path to your HTML or PDF document
File inputFile = new File("input.html");
// Create a Tika parser object
AutoDetectParser parser = new AutoDetectParser();
// Create a metadata object to store the document metadata
Metadata metadata = new Metadata();
// Create a handler to receive the extracted text
BodyContentHandler handler = new BodyContentHandler();
// Create a context object to configure the parser
ParseContext context = new ParseContext();
// Set the HTML parser if the input is an HTML file
if (inputFile.getName().endsWith(".html")) {
HtmlParser htmlParser = new HtmlParser();
htmlParser.parse(new FileInputStream(inputFile), handler, metadata, context);
} else {
// Otherwise, assume the input is a PDF file
InputStream stream = new FileInputStream(inputFile);
parser.parse(stream, handler, metadata, context);
}
// Print the extracted text
System.out.println(handler.toString());
}
}
If the input file is an HTML file, you create a HtmlParser
object to parse the HTML and extract the text. Otherwise, if the input file is a PDF file, you use the AutoDetectParser
to detect the format and extract the text.
Finally, you print the extracted text using the toString()
method of the BodyContentHandler
.
Note that this code assumes that you have Apache Tika and all its dependencies installed and configured in your Java project.