解析.doc和.docx以使用golang获取所有文本？

2025-06-09 08:32:23 GO 8087

在 Go 语言中解析 .doc 和 .docx 文件以提取文本，可以使用不同的库，因为这两种文件格式的结构差异较大。以下是详细的步骤和示例代码，帮助你使用 Go 语言从这两种格式的文件中提取文本。

1. 解析 `.docx` 文件

.docx 文件是基于 XML 的文件格式，使用 ZIP 归档压缩多个 XML 文件和其他资源。可以使用 Go 的 unzip 和 xml 包或专门的库如 unioffice 来处理。

使用 `unioffice` 库

unioffice 是一个 Go 语言库，用于读取和写入 Office 文件（包括 .docx）。下面是如何使用 unioffice 读取 .docx 文件的示例：

安装 unioffice 库：

bash
go get github.com/unidoc/unioffice

读取 .docx 文件的示例代码：

go
package main

import (
	"fmt"
	"log"

	"github.com/unidoc/unioffice/document"
)

func main() {
	// Open the .docx file
	doc, err := document.Open("example.docx")
	if err != nil {
		log.Fatalf("error opening document: %s", err)
	}

	// Iterate through all paragraphs and extract text
	for _, para := range doc.Paragraphs() {
		fmt.Println(para.Text())
	}
}

2. 解析 `.doc` 文件

.doc 文件是二进制格式，相对较老，不容易直接解析。通常需要使用专门的库或工具。Go 语言没有直接处理 .doc 的库，但可以通过其他工具进行转换。

使用 `python-docx` 转换 `.doc` 到 `.docx`

如果你的 .doc 文件可以转换为 .docx 文件，可以先使用 Python 脚本将 .doc 转换为 .docx，然后用 Go 语言解析 .docx 文件。以下是一个 Python 脚本示例，使用 pypandoc 库进行转换。

Python 转换脚本：

python
import pypandoc

# Convert .doc to .docx
def convert_doc_to_docx(input_file, output_file):
    pypandoc.convert_file(input_file, 'docx', outputfile=output_file)

convert_doc_to_docx('example.doc', 'example.docx')

使用 Go 调用外部工具

如果不想使用 Python，你也可以使用命令行工具，如 antiword，通过 Go 调用它来提取 .doc 文件的文本：

调用 antiword 提取文本：

go
package main

import (
	"bytes"
	"fmt"
	"os/exec"
)

func main() {
	// Call antiword to extract text from .doc file
	cmd := exec.Command("antiword", "example.doc")
	var out bytes.Buffer
	cmd.Stdout = &out
	err := cmd.Run()
	if err != nil {
		fmt.Printf("error executing command: %s", err)
		return
	}
	fmt.Println(out.String())
}

3. 总结

.docx 文件: 使用 unioffice 库可以方便地读取文本。
.doc 文件: 由于其二进制格式，可以考虑先将 .doc 文件转换为 .docx，然后使用 Go 处理；或者使用外部工具如 antiword 提取文本。

4. 完整示例

以下是一个完整的 Go 示例，结合了从 .docx 文件读取文本和调用外部工具处理 .doc 文件的两部分代码：

go
package main

import (
	"bytes"
	"fmt"
	"log"
	"os/exec"

	"github.com/unidoc/unioffice/document"
)

func readDocx(filePath string) {
	doc, err := document.Open(filePath)
	if err != nil {
		log.Fatalf("error opening .docx document: %s", err)
	}

	for _, para := range doc.Paragraphs() {
		fmt.Println(para.Text())
	}
}

func readDoc(filePath string) {
	cmd := exec.Command("antiword", filePath)
	var out bytes.Buffer
	cmd.Stdout = &out
	err := cmd.Run()
	if err != nil {
		fmt.Printf("error executing command: %s", err)
		return
	}
	fmt.Println(out.String())
}

func main() {
	fmt.Println("Reading .docx file:")
	readDocx("example.docx")

	fmt.Println("\nReading .doc file:")
	readDoc("example.doc")
}

这个代码展示了如何使用 unioffice 读取 .docx 文件的文本，以及如何使用 antiword 提取 .doc 文件的文本。根据具体需求，你可以选择适合的方案来处理不同格式的文件。

1. 解析 .docx 文件

使用 unioffice 库

2. 解析 .doc 文件

使用 python-docx 转换 .doc 到 .docx