OWASP Logo

How to sanitize HTML in Java

Anytime our web application receives any text that will be rendered to HTML, we must sanitize this text to avoid potential XSS attacks.

OWASP provides a great tool to help us sanitize HTML.

Summary

  1. Set up the project
  2. Define the sanitization policies
  3. Write tests
  4. Conclusion

Set up the project

We will use:

Here is our Maven pom.xml file:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.blebail.blog.sample</groupId>
    <artifactId>java-sanitize-html</artifactId>
    <packaging>jar</packaging>
    <version>1.0-SNAPSHOT</version>
    <name>java-sanitize-html</name>
    <url>https://github.com/baptistelebail/samples/tree/master/java-sanitize-html</url>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencies>
        <!-- HTML Sanitizer -->
        <dependency>
            <groupId>com.googlecode.owasp-java-html-sanitizer</groupId>
            <artifactId>owasp-java-html-sanitizer</artifactId>
            <version>20191001.1</version>
        </dependency>

        <!-- JUnit -->
        <dependency>
            <groupId>org.junit.jupiter</groupId>
            <artifactId>junit-jupiter</artifactId>
            <version>5.6.0</version>
            <scope>test</scope>
        </dependency>

        <!-- AssertJ -->
        <dependency>
            <groupId>org.assertj</groupId>
            <artifactId>assertj-core</artifactId>
            <version>3.15.0</version>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.8.1</version>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.22.2</version>
            </plugin>
        </plugins>
    </build>
</project>

Define the sanitization policies

Let’s say our web application is a community website, with articles and comments.

Our application will process and render HTML in three use cases:

  1. a user submits a comment
  2. one of the publishers submits an article
  3. a user updates its profile description (for which we provide a <my-element> component the user can use)

We will need, respectively, 3 sanitization policies:

  1. a strict policy removing any HTML from the comment
  2. a policy allowing common HTML elements found in articles (titles, paragraphs, images, links, … )
  3. a custom policy allowing only our element <my-element>

We create the SanitizationPolicy interface:

public interface SanitizationPolicy {

    /**
     * Sanitizes the string according to the policy
     * @param input the input string to be sanitized
     * @return the sanitized string
     */
    String sanitize(String input);
}

Which will simply sanitize an input string.

OWASP HTML Sanitizer provides several ways to create sanitization policies (which OWASP named PolicyFactory): the policies we can create manually, via the HtmlPolicyBuilder, or the pre-made policies via the Sanitizers.* which can be combined with the and() method.

We will implement the SanitizationPolicy interface with an Enum, which will have three values:

  1. STRICT
  2. ARTICLE
  3. CUSTOM

Each one associated to a specific PolicyFactory.

We create the HtmlSanitizationPolicy enum:

import org.owasp.html.HtmlPolicyBuilder;
import org.owasp.html.PolicyFactory;
import org.owasp.html.Sanitizers;

public enum HtmlSanitizationPolicy implements SanitizationPolicy {

    STRICT(new HtmlPolicyBuilder()
            .toFactory()),

    ARTICLE(Sanitizers.BLOCKS
            .and(Sanitizers.FORMATTING)
            .and(Sanitizers.STYLES)
            .and(Sanitizers.IMAGES)
            .and(Sanitizers.LINKS)),

    CUSTOM(new HtmlPolicyBuilder()
            .allowElements("my-element")
            .toFactory());

    private final PolicyFactory policyFactory;

    HtmlSanitizationPolicy(PolicyFactory policyFactory) {
        this.policyFactory = policyFactory;
    }

    @Override
    public String sanitize(String input) {
        return policyFactory.sanitize(input);
    }
}

We can now sanitize any text, according to our policies, with HtmlSanitizationPolicy.<POLICY>.sanitize(...).

Write tests

We will validate our policies witch a few tests:

  1. links (<a>) and JavaScript (<script>) are not allowed with our STRICT policy
  2. JavaScript (<script>) is not allowed with our ARTICLE policy but common article elements (<p>, <strong>, style="...", <img>, <a>, <h1>, <h2>, ...) are.
    We will use a JUnit 5 ParameterizedTest and test a few examples with @ValueSource
  3. links (<a>) and JavaScript (<script>) are not allowed with our CUSTOM policy but <my-element> is

We create HtmlSanitizationPolicyTest in src/test/java:

import org.junit.jupiter.api.Test;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.ValueSource;

import static org.assertj.core.api.Assertions.assertThat;

public final class HtmlSanitizationPolicyTest {

    @Test
    public void shouldNotAllowLinksOrJavaScriptOnStrictPolicy() {
        String text = "Text with <a href=\"https://example.com\">a link</a> " +
                "and<script>alert('javascript');</script>";

        String sanitized = HtmlSanitizationPolicy.STRICT.sanitize(text);

        assertThat(sanitized).isEqualTo("Text with a link and");
    }

    @Test
    public void shouldNotAllowJavaScriptOnArticlePolicy() {
        String text = "Text with <a href=\"https://example.com\" rel=\"nofollow\">a link</a> " +
                "and<script>alert('javascript');</script>";

        String sanitized = HtmlSanitizationPolicy.ARTICLE.sanitize(text);

        assertThat(sanitized).isEqualTo("Text with <a href=\"https://example.com\" rel=\"nofollow\">a link</a> and");
    }

    @ParameterizedTest
    @ValueSource(strings = {
            "A <h1>Title</h1> and a <p>paragraph</p>",
            "<strong>Strong</strong> and <em>emphasized</em>",
            "Code with <span style=\"color:red\">style</span>",
            "An <img src=\"https://example.com/img.jpg\" width=\"200\" />",
            "A <a href=\"https://example.com\" rel=\"nofollow\">link</a>"
    })
    public void shouldAllowCommonArticleElementsOnArticlePolicy(String text) {
        String sanitized = HtmlSanitizationPolicy.ARTICLE.sanitize(text);

        assertThat(sanitized).isEqualTo(text);
    }

    @Test
    public void shouldNotAllowLinksOrJavaScriptOnCustomPolicy() {
        String text = "Text with <a href=\"https://example.com\" rel=\"nofollow\">a link</a> " +
                "and<script>alert('javascript');</script> and <my-element>Mine</my-element>";

        String sanitized = HtmlSanitizationPolicy.CUSTOM.sanitize(text);

        assertThat(sanitized).isEqualTo("Text with a link and and <my-element>Mine</my-element>");
    }

    @Test
    public void shouldAllowMyElementOnCustomPolicy() {
        String text = "Text with <my-element>Mine</my-element>";

        String sanitized = HtmlSanitizationPolicy.CUSTOM.sanitize(text);

        assertThat(sanitized).isEqualTo(text);
    }
}

Summary

OWASP HTML Sanitizer allows us to very quickly create HTML sanitization policies, covering most common needs, and also provides an API to go far beyond, if we want to customize our policies to allow any attribute on any element.

(The whole project sources are available here)