Sanitizing HTML

frderek · August 27, 2020, 2:08pm

This is my first post here so please be gentle.
I am not exactly a beginner at PHP, I have been maintaining and writing scripts for PHP web-sites for 13 years now. For the same reason, I am too long in the tooth to relearn it all using object-oriented so please forgive the “procedural”. I am, however pretty much a beginner when it comes to file-handling and advanced string manipulation.

Here’s the situation:
There is a “newsletter” page on a site I maintain that I wish the newsletter author to be able to maintain without my having to intervene. The plan is to have her produce a newsletter.html file using word-processing software with which she is familiar (and save/export it as HTML). I will write a simple script to upload that into an /uploads directory on the server. Thus far, simple.

The problem: MSWord especially and Libre Office to a lesser extent try to control the layout format, font etc. whereas I want the page to obey the CSS which is common to the site. Below is a test document created in Libre Office. (sorry it can’t be attached so has to be inline).

Basically, what I want to do is something like this:
<?php
//ini_set(‘display_errors’,1);
//ini_set(‘display_startup_errors’,1);
//error_reporting(-1);
require_once ‘myfunctions.php’;
/require_once ‘…/public_connect.php’;/

//output header
page_header("News");
?>
<body>

<?php
require_once 'nav.php';
?>
<article>
    <img alt="U3A logo" src="images/U3ALogo.png" width="200">
    <br>
    <?php
    $dirty = get_file_contents("/uploads/newsletter.html");
    $clean = sanitize($dirty);
    echo $clean
    ?>
</article>
<?php
require_once 'footer.php';
?>
</body>
</html>

Where the function ‘sanitize’ does the following:

1. Ignore (delete) everything from the beginning of the string up to and including the <body> tag
2. Remove any 'class=', 'style=' etc from any of the following h1, h2, h3, p, blockquote etc. opening tags to leave a plain <p>, <h1> etc opening tag (which will obey the CCS)
4. Get rid of all the <span> and </span> nonsense
3. Ignore (delete) the </body> tag and everything after it

That ought to leave me with simple HTML that will pass the W3C checker. I hope that makes sense!

Herewith the test HTML from Libre Office so that you can see what I’m getting at:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
	<meta http-equiv="content-type" content="text/html; charset=utf-8"/>
	<title></title>
	<meta name="generator" content="LibreOffice 6.4.5.2 (Linux)"/>
	<meta name="created" content="2020-08-26T13:35:38.547153674"/>
	<meta name="changed" content="2020-08-27T15:03:47.012662539"/>
	<style type="text/css">
		@page { size: 8.27in 11.69in; margin: 0.79in }
		p { margin-bottom: 0.1in; background: transparent; line-height: 115%; background: transparent }
		h1 { margin-bottom: 0.08in; background: transparent; background: transparent; page-break-after: avoid }
		h1.western { font-family: "Liberation Sans", sans-serif; font-size: 18pt; font-weight: bold }
		h1.cjk { font-family: "FreeSans"; font-size: 18pt; font-weight: bold }
		h1.ctl { font-family: "FreeSans"; font-size: 18pt; font-weight: bold }
		h2 { margin-top: 0.14in; margin-bottom: 0.08in; background: transparent; background: transparent; page-break-after: avoid }
		h2.western { font-family: "Liberation Sans", sans-serif; font-size: 16pt; font-weight: bold }
		h2.cjk { font-family: "FreeSans"; font-size: 16pt; font-weight: bold }
		h2.ctl { font-family: "FreeSans"; font-size: 16pt; font-weight: bold }
		h3 { margin-top: 0.1in; margin-bottom: 0.08in; background: transparent; background: transparent; page-break-after: avoid }
		h3.western { font-family: "Liberation Sans", sans-serif; font-size: 14pt; font-weight: bold }
		h3.cjk { font-family: "FreeSans"; font-size: 14pt; font-weight: bold }
		h3.ctl { font-family: "FreeSans"; font-size: 14pt; font-weight: bold }
		blockquote { margin-left: 0.39in; margin-right: 0.39in; background: transparent; background: transparent }
		a:link { color: #000080; so-language: zxx; text-decoration: underline }
		a:visited { color: #800000; so-language: zxx; text-decoration: underline }
	</style>
</head>
<body lang="en-GB" link="#000080" vlink="#800000" dir="ltr"><h1 class="western">
Heading 1</h1>
<h2 class="western">Heading 2</h2>
<h3 class="western">Heading 3</h3>
<blockquote><i>This is a quotation</i></blockquote>
<p>And this is Text Body. <b>I wonder</b> what HTML <i>will do with</i>
it? 
</p>
<p>What about <u>underlined text and</u><span style="text-decoration: none">
what about </span><strike><span style="text-decoration: none">struck
through text</span></strike><span style="text-decoration: none">? </span>
</p>
<p><span style="text-decoration: none"><sup>Superscript</sup></span><span style="text-decoration: none">
and perhaps </span><span style="text-decoration: none"><sub>subscript
</sub></span><span style="text-decoration: none">text would be good
to have, I think.</span></p>
<p style="margin-bottom: 0in; line-height: 100%"><br/>

</p>
</body>
</html>

GiTLEZ · September 15, 2020, 12:24am

Based on your example html, this should work. Plus some additional whitespace editing

<?php



function _sanitize_html( string $html ):string {
	// Ignore Everything Upto and include <body>
	$offset = strpos( $html, '<body');
	$p1 = strpos( $html, '>', $offset);
	$html = substr( $html, $p1);
	
	// Ignore </body> and after
	$html = substr( $html, 0, strpos( $html, '</body>'));
	
	// Get rid of all the <span> and </span> nonsense
	$html = str_replace( '</span>', '', $html);
	$html = preg_replace( '/<span([^>]*)>/', '', $html);
	
	// Get rid of element properties.
	$html = preg_replace( '/<([a-z0-9]+)([^>]*)>/', '<$1>', $html);
	
	return $html;
}

Cheers