This is my first post here so please be gentle.
I am not exactly a beginner at PHP, I have been maintaining and writing scripts for PHP web-sites for 13 years now. For the same reason, I am too long in the tooth to relearn it all using object-oriented so please forgive the “procedural”. I am, however pretty much a beginner when it comes to file-handling and advanced string manipulation.
Here’s the situation:
There is a “newsletter” page on a site I maintain that I wish the newsletter author to be able to maintain without my having to intervene. The plan is to have her produce a newsletter.html file using word-processing software with which she is familiar (and save/export it as HTML). I will write a simple script to upload that into an /uploads directory on the server. Thus far, simple.
The problem: MSWord especially and Libre Office to a lesser extent try to control the layout format, font etc. whereas I want the page to obey the CSS which is common to the site. Below is a test document created in Libre Office. (sorry it can’t be attached so has to be inline).
Basically, what I want to do is something like this:
<?php
//ini_set(‘display_errors’,1);
//ini_set(‘display_startup_errors’,1);
//error_reporting(-1);
require_once ‘myfunctions.php’;
/require_once ‘…/public_connect.php’;/
//output header
page_header("News");
?>
<body>
<?php
require_once 'nav.php';
?>
<article>
<img alt="U3A logo" src="images/U3ALogo.png" width="200">
<br>
<?php
$dirty = get_file_contents("/uploads/newsletter.html");
$clean = sanitize($dirty);
echo $clean
?>
</article>
<?php
require_once 'footer.php';
?>
</body>
</html>
Where the function ‘sanitize’ does the following:
1. Ignore (delete) everything from the beginning of the string up to and including the <body> tag
2. Remove any 'class=', 'style=' etc from any of the following h1, h2, h3, p, blockquote etc. opening tags to leave a plain <p>, <h1> etc opening tag (which will obey the CCS)
4. Get rid of all the <span> and </span> nonsense
3. Ignore (delete) the </body> tag and everything after it
That ought to leave me with simple HTML that will pass the W3C checker. I hope that makes sense!
Herewith the test HTML from Libre Office so that you can see what I’m getting at:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8"/>
<title></title>
<meta name="generator" content="LibreOffice 6.4.5.2 (Linux)"/>
<meta name="created" content="2020-08-26T13:35:38.547153674"/>
<meta name="changed" content="2020-08-27T15:03:47.012662539"/>
<style type="text/css">
@page { size: 8.27in 11.69in; margin: 0.79in }
p { margin-bottom: 0.1in; background: transparent; line-height: 115%; background: transparent }
h1 { margin-bottom: 0.08in; background: transparent; background: transparent; page-break-after: avoid }
h1.western { font-family: "Liberation Sans", sans-serif; font-size: 18pt; font-weight: bold }
h1.cjk { font-family: "FreeSans"; font-size: 18pt; font-weight: bold }
h1.ctl { font-family: "FreeSans"; font-size: 18pt; font-weight: bold }
h2 { margin-top: 0.14in; margin-bottom: 0.08in; background: transparent; background: transparent; page-break-after: avoid }
h2.western { font-family: "Liberation Sans", sans-serif; font-size: 16pt; font-weight: bold }
h2.cjk { font-family: "FreeSans"; font-size: 16pt; font-weight: bold }
h2.ctl { font-family: "FreeSans"; font-size: 16pt; font-weight: bold }
h3 { margin-top: 0.1in; margin-bottom: 0.08in; background: transparent; background: transparent; page-break-after: avoid }
h3.western { font-family: "Liberation Sans", sans-serif; font-size: 14pt; font-weight: bold }
h3.cjk { font-family: "FreeSans"; font-size: 14pt; font-weight: bold }
h3.ctl { font-family: "FreeSans"; font-size: 14pt; font-weight: bold }
blockquote { margin-left: 0.39in; margin-right: 0.39in; background: transparent; background: transparent }
a:link { color: #000080; so-language: zxx; text-decoration: underline }
a:visited { color: #800000; so-language: zxx; text-decoration: underline }
</style>
</head>
<body lang="en-GB" link="#000080" vlink="#800000" dir="ltr"><h1 class="western">
Heading 1</h1>
<h2 class="western">Heading 2</h2>
<h3 class="western">Heading 3</h3>
<blockquote><i>This is a quotation</i></blockquote>
<p>And this is Text Body. <b>I wonder</b> what HTML <i>will do with</i>
it?
</p>
<p>What about <u>underlined text and</u><span style="text-decoration: none">
what about </span><strike><span style="text-decoration: none">struck
through text</span></strike><span style="text-decoration: none">? </span>
</p>
<p><span style="text-decoration: none"><sup>Superscript</sup></span><span style="text-decoration: none">
and perhaps </span><span style="text-decoration: none"><sub>subscript
</sub></span><span style="text-decoration: none">text would be good
to have, I think.</span></p>
<p style="margin-bottom: 0in; line-height: 100%"><br/>
</p>
</body>
</html>