{"id":77,"date":"2023-03-09T14:39:01","date_gmt":"2023-03-09T14:39:01","guid":{"rendered":"https:\/\/smartsource.com.sg\/blog\/?p=77"},"modified":"2023-03-09T14:39:01","modified_gmt":"2023-03-09T14:39:01","slug":"apache-tika-to-parse-and-extract-text-from-html-or-pdf-documents-using-java","status":"publish","type":"post","link":"https:\/\/smartsource.com.sg\/blog\/index.php\/2023\/03\/09\/apache-tika-to-parse-and-extract-text-from-html-or-pdf-documents-using-java\/","title":{"rendered":"Apache Tika to parse and extract text from HTML or PDF documents using Java"},"content":{"rendered":"\n<p>In this code, you first create a <code>File<\/code> object for your HTML or PDF document. Then, you create a Tika <code>AutoDetectParser<\/code> object to automatically detect the document format. You also create a <code>Metadata<\/code> object to store the document metadata, a <code>BodyContentHandler<\/code> object to receive the extracted text, and a <code>ParseContext<\/code> object to configure the parser.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import java.io.File;\r\nimport java.io.FileInputStream;\r\nimport java.io.InputStream;\r\n\r\nimport org.apache.tika.metadata.Metadata;\r\nimport org.apache.tika.parser.AutoDetectParser;\r\nimport org.apache.tika.parser.ParseContext;\r\nimport org.apache.tika.parser.html.HtmlParser;\r\nimport org.apache.tika.sax.BodyContentHandler;\r\n\r\npublic class TikaParser {\r\n\r\n    public static void main(String&#91;] args) throws Exception {\r\n\r\n        \/\/ Replace \"input.html\" with the path to your HTML or PDF document\r\n        File inputFile = new File(\"input.html\");\r\n\r\n        \/\/ Create a Tika parser object\r\n        AutoDetectParser parser = new AutoDetectParser();\r\n\r\n        \/\/ Create a metadata object to store the document metadata\r\n        Metadata metadata = new Metadata();\r\n\r\n        \/\/ Create a handler to receive the extracted text\r\n        BodyContentHandler handler = new BodyContentHandler();\r\n\r\n        \/\/ Create a context object to configure the parser\r\n        ParseContext context = new ParseContext();\r\n\r\n        \/\/ Set the HTML parser if the input is an HTML file\r\n        if (inputFile.getName().endsWith(\".html\")) {\r\n            HtmlParser htmlParser = new HtmlParser();\r\n            htmlParser.parse(new FileInputStream(inputFile), handler, metadata, context);\r\n        } else {\r\n            \/\/ Otherwise, assume the input is a PDF file\r\n            InputStream stream = new FileInputStream(inputFile);\r\n            parser.parse(stream, handler, metadata, context);\r\n        }\r\n\r\n        \/\/ Print the extracted text\r\n        System.out.println(handler.toString());\r\n    }\r\n}\r\n<\/code><\/pre>\n\n\n\n<p>If the input file is an HTML file, you create a <code>HtmlParser<\/code> object to parse the HTML and extract the text. Otherwise, if the input file is a PDF file, you use the <code>AutoDetectParser<\/code> to detect the format and extract the text.<\/p>\n\n\n\n<p>Finally, you print the extracted text using the <code>toString()<\/code> method of the <code>BodyContentHandler<\/code>.<\/p>\n\n\n\n<p>Note that this code assumes that you have Apache Tika and all its dependencies installed and configured in your Java project.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this code, you first create a File object for your HTML or PDF document. Then, you create a Tika AutoDetectParser object to automatically detect the document format. You also&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[19],"tags":[56,57,58],"class_list":["post-77","post","type-post","status-publish","format-standard","hentry","category-tutorials","tag-apache-tika","tag-java","tag-parser"],"_links":{"self":[{"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/posts\/77","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=77"}],"version-history":[{"count":1,"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/posts\/77\/revisions"}],"predecessor-version":[{"id":78,"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/posts\/77\/revisions\/78"}],"wp:attachment":[{"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=77"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=77"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=77"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}