一、问题描述

有个需求就是读取word中的内容,poi可以实现这种需求,网上找了下示例,代码如下:

    /**
     * 读取doc文件内容
     *
     * @param fs 想要读取的文件对象
     * @return 返回文件内容
     * @throws IOException
     */
    public static String readDoc(FileInputStream fs) throws IOException {
        StringBuilder result = new StringBuilder();
        WordExtractor re = new WordExtractor(fs);
        result.append(re.getText());
        re.close();
        return result.toString();
    }

    public static String readDocToStr(File file) throws IOException {
        return readDoc(new FileInputStream(file));
    }

    public static void main(String[] args) {
        File file = new File("D:\\xxx\\xxx\\1\\file\\2021\\04\\28\\34a58ac4faa4222712a4329ac60f34f9\\34a58ac4faa4222712a4329ac60f34f9.docx");
        try {
            System.out.println(readDocToStr(file));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

依赖如下:

    <dependency>
      <groupId>org.apache.poi</groupId>
      <artifactId>poi</artifactId>
      <version>5.0.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.poi</groupId>
      <artifactId>poi-ooxml</artifactId>
      <version>5.0.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.poi</groupId>
      <artifactId>poi-scratchpad</artifactId>
      <version>5.0.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.poi</groupId>
      <artifactId>poi-ooxml-full</artifactId>
      <version>5.0.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.poi</groupId>
      <artifactId>poi</artifactId>
      <version>5.0.0</version>
    </dependency>

上面的代码运行会报如下错误:

Exception in thread "main" java.lang.IllegalArgumentException: The document is really a OOXML file
	at org.apache.poi.hwpf.HWPFDocumentCore.verifyAndBuildPOIFS(HWPFDocumentCore.java:126)
	at org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:52)

OOXML错误。

二、解决方法

报上面的错误就是不能正确的解析docx文档导致的,改成如下就行了:

    static String read(InputStream is) throws Exception {
        System.out.println(FileMagic.valueOf(is));
        String text = "";
        if (FileMagic.valueOf(is) == FileMagic.OLE2) {
            WordExtractor ex = new WordExtractor(is);
            text = ex.getText();
            ex.close();
        } else if (FileMagic.valueOf(is) == FileMagic.OOXML) {
            XWPFDocument doc = new XWPFDocument(is);
            XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
            text = extractor.getText();
            extractor.close();
        }
        return text;
    }

    public static void main(String[] args) throws Exception {
        InputStream is = new BufferedInputStream(new FileInputStream("D:\\xxx\\xxx\\1\\file\\2021\\04\\28\\34a58ac4faa4222712a4329ac60f34f9\\34a58ac4faa4222712a4329ac60f34f9.docx")); //really a OOXML Word file
        System.out.println(read(is));
        is.close();
    }
Logo

GitCode 天启AI是一款由 GitCode 团队打造的智能助手,基于先进的LLM(大语言模型)与多智能体 Agent 技术构建,致力于为用户提供高效、智能、多模态的创作与开发支持。它不仅支持自然语言对话,还具备处理文件、生成 PPT、撰写分析报告、开发 Web 应用等多项能力,真正做到“一句话,让 Al帮你完成复杂任务”。

更多推荐