Apache Tika to parse and extract text from HTML or PDF documents using Java
In this code, you first create a File object for your HTML or PDF document. Then, you create a Tika AutoDetectParser object to automatically detect the document format. You also create a Metadata object to store the document metadata, a BodyContentHandler object to receive the extracted text, and a ParseContext object to configure the parser.
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.sax.BodyContentHandler;
public class TikaParser {
public static void main(String[] args) throws Exception {
// Replace "input.html" with the path to your HTML or PDF document
File inputFile = new File("input.html");
// Create a Tika parser object
AutoDetectParser parser = new AutoDetectParser();
// Create a metadata object to store the document metadata
Metadata metadata = new Metadata();
// Create a handler to receive the extracted text
BodyContentHandler handler = new BodyContentHandler();
// Create a context object to configure the parser
ParseContext context = new ParseContext();
// Set the HTML parser if the input is an HTML file
if (inputFile.getName().endsWith(".html")) {
HtmlParser htmlParser = new HtmlParser();
htmlParser.parse(new FileInputStream(inputFile), handler, metadata, context);
} else {
// Otherwise, assume the input is a PDF file
InputStream stream = new FileInputStream(inputFile);
parser.parse(stream, handler, metadata, context);
}
// Print the extracted text
System.out.println(handler.toString());
}
}
If the input file is an HTML file, you create a HtmlParser object to parse the HTML and extract the text. Otherwise, if the input file is a PDF file, you use the AutoDetectParser to detect the format and extract the text.
Finally, you print the extracted text using the toString() method of the BodyContentHandler.
Note that this code assumes that you have Apache Tika and all its dependencies installed and configured in your Java project.