使用Java删除HTML标签

1.概述

有时，我们想删除所有的HTML标签，并从一个HTML文档的字符串中提取文本。

这个问题看起来很简单。然而，根据不同的要求，它可以有不同的变体。

在本教程中，我们将讨论如何使用Java来实现这一目标。

2.使用Regex

由于我们已经把HTML作为一个String变量，我们需要做一种文本处理。

当遇到文本处理问题时，正则表达式(Regex)可能是第一个出现的想法。

从一个字符串中删除HTML标签对Regex来说不是一个挑战，因为不管是开始还是结束的HTML元素，它们都是按照模式“<…>”进行的。

如果我们把它翻译成Regex，就是“<[^>]*>”或者“<.*?>”。

我们应该注意，Regex默认进行贪婪匹配。也就是说，Regex “<.*>”对我们的问题不起作用，因为我们想从‘<‘直到下一个‘>‘而不是一行中的最后一个‘>‘。

现在，让我们来测试一下它是否能从HTML源中删除标签。

2.1.从example1.html删除标签

在我们测试删除HTML标签之前，首先让我们创建一个HTML的例子，比如说example1.html。

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
    <title>This is the page title</title>
</head>
<body>
    <p>
        If the application X doesn't start, the possible causes could be:<br/>
        1. <a href="maven.com">Maven</a> is not installed.<br/>
        2. Not enough disk space.<br/>
        3. Not enough memory.
    </p>
</body>
</html>

现在，让我们写一个测试，并使用String.replaceAll()来删除HTML标签。

String html = ... // load example1.html
String result = html.replaceAll("<[^>]*>", "");
System.out.println(result);

如果我们运行测试方法，我们就会看到结果。



    This is the page title


    
        If the application X doesn't start, the possible causes could be:
        1. Maven is not installed.
        2. Not enough disk space.
        3. Not enough memory.

输出看起来很不错。这是因为所有的HTML标签都已被删除。

它保留了剥离的HTML中的空白处。但我们在处理提取的文本时，可以很容易地删除或跳过这些空行或空白处。到目前为止，一切都很好。

2.2.从example2.html删除标签

正如我们刚才所看到的，使用Regex来删除HTML标签是非常直接的。然而，这种方法可能有问题，因为我们无法预测会得到什么HTML源。

例如，一个HTML文档可能有<script>或<style>标签，而我们可能不想在结果中出现它们的内容。

此外，<script>、<style>，甚至<body>标签中的文本可能包含“<”或“>”字符。如果是这种情况，我们的Regex方法可能会失败。

现在，让我们来看看另一个HTML例子，比如说example2.html。

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
    <title>This is the page title</title>
</head>
<script>
    // some interesting script functions
</script>
<body>
    <p>
        If the application X doesn't start, the possible causes could be:<br/>
        1. <a
            id="link"
            href="http://maven.apache.org/">
            Maven
            </a> is not installed.<br/>
        2. Not enough (<1G) disk space.<br/>
        3. Not enough (<64MB) memory.<br/>
    </p>
</body>
</html>

这一次，我们有一个<script>标签和<body>标签中的“<”字符。

如果我们在example2.html上使用同样的方法，我们会得到（空行已被删除）。

   This is the page title
    // some interesting script functions    
        If the application X doesn't start, the possible causes could be:
        1. 
            Maven
             is not installed.
        2. Not enough (
        3. Not enough (

显然，由于“<”字符，我们失去了一些文本。

因此，使用Regex来处理XML或HTML是脆弱的。相反，我们可以选择一个HTML分析器来完成这项工作。

接下来，我们将讨论几个易于使用的HTML库，以提取文本。

3.使用Jsoup

Jsoup是一个流行的HTML解析器。要从一个HTML文档中提取文本，我们可以简单地调用Jsoup.parse(htmlString).text()。

首先，我们需要将Jsoup库添加到classpath。例如，假设我们使用Maven来管理项目的依赖关系。

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>
</dependency>

现在，让我们用我们的example2.html来测试它。

String html = ... // load example2.html
System.out.println(Jsoup.parse(html).text());

如果我们让这个方法运行一下，它就会打印出来。

This is the page title If the application X doesn't start, the possible causes could be: 1. Maven is not installed. 2. Not enough (<1G) disk space. 3. Not enough (<64MB) memory.

如输出结果所示，Jsoup已经成功地从HTML文档中提取了文本。另外，<script>元素中的文本也被忽略了。

此外，默认情况下，Jsoup将删除所有的文本格式和空白，如换行符。

但是，如果需要的话，我们也可以要求Jsoup保留换行符。

4.使用HTMLCleaner

HTMLCleaner是另一个HTML解析器。它的目标是使来自 Web 的“格式错误和肮脏”的 HTML 适合进一步处理。

首先，让我们在我们的pom.xml中添加HTMLCleaner的依赖关系。

<dependency>
    <groupId>net.sourceforge.htmlcleaner</groupId>
    <artifactId>htmlcleaner</artifactId>
    <version>2.25</version>
</dependency>

我们可以设置各种选项，以控制HTMLCleaner的解析行为。

在这里，作为一个例子，让我们告诉HTMLCleaner在解析example2.html时，跳过<script>元素。

String html = ... // load example2.html
CleanerProperties props = new CleanerProperties();
props.setPruneTags("script");
String result = new HtmlCleaner(props).clean(html).getText().toString();
System.out.println(result);

如果我们运行测试，HTMLCleaner就会产生这样的输出。

    This is the page title


    
        If the application X doesn't start, the possible causes could be:
        1. 
            Maven
             is not installed.
        2. Not enough (<1G) disk space.
        3. Not enough (<64MB) memory.

正如我们所看到的，<script>元素中的内容被忽略了。

此外，它将<br/>标签转换为提取的文本中的换行符。如果格式很重要，这可能很有帮助。

另一方面，HTMLCleaner保留了剥离后的HTML源的空白处。因此，例如，文本“1. Maven is not installed”被分成了三行。

5.使用Jericho

最后，我们将看到另一个HTML解析器 – Jericho。它有一个很好的功能：用简单的文本格式化来渲染HTML标记。我们稍后将看到它的运行。

像往常一样，让我们首先在pom.xml中添加Jericho的依赖：

<dependency>
    <groupId>net.htmlparser.jericho</groupId>
    <artifactId>jericho-html</artifactId>
    <version>3.4</version>
</dependency>

在我们的example2.html中，我们有一个超链接“Maven (http://maven.apache.org/)“。现在，假设我们想在结果中同时出现链接的URL和链接文本。

要做到这一点，我们可以创建一个Renderer对象，并使用includeHyperlinkURLs选项。

String html = ... // load example2.html
Source htmlSource = new Source(html);
Segment segment = new Segment(htmlSource, 0, htmlSource.length());
Renderer htmlRender = new Renderer(segment).setIncludeHyperlinkURLs(true);
System.out.println(htmlRender);

接下来，让我们执行测试并检查输出结果。

If the application X doesn't start, the possible causes could be:
1. Maven <http://maven.apache.org/> is not installed.
2. Not enough (<1G) disk space.
3. Not enough (<64MB) memory.

正如我们在上面的结果中看到的，文本已经被漂亮的格式化了。另外，在<title>元素中的文本被默认忽略了。

链接的URL也被包括在内。除了渲染链接（<a>），Jericho支持渲染其他HTML标签，例如<hr/>，<br/>，项目符号列表（<ul>和<li>）等。

6.结论

在这篇文章中，我们已经解决了去除HTML标签和提取HTML文本的不同方法。

我们应该注意，使用Regex来处理XML/HTML并不是一个好的做法。

一如既往，本文的完整源代码可以在GitHub上找到。