[ILUG-Webdev] PHP - converting HTML entities outside tags

Kae Verens kae at verens.com
Wed Aug 25 07:14:22 IST 2004


Lee Hosty wrote:
> I'm using HTMLFilter (http://linux.duke.edu/projects/mini/htmlfilter/) to
> safely allow certain HTML tags as user input to be displayed later. I use
> htmlspecialchars() to display this text in a textarea during editting, so
> everything displays as valid XHTML at this stage, without any user
> confusion.
> 
> However if the user inputs a HTML entity (&, " or ' for example), it gets
> saved as is to the DB - and displays fine as XHTML in a textarea - but
> when output to a browser at the viewing stage (as opposed to the editting
> stage) - these bare entities are not valid XHTML and need converting. I
> can do this either at the saving to DB stage or just before outputting to
> browser.
> 
> However I can't blindly convert all HTML entities found to their relative
> values anymore using htmlspecialchars(), as some of the entities may be
> inside the tags that the user has input, and I don't want these converted.
> 
> ie. user inputs <a href="whatever.html" target='new_target'>"my amazin'
> links & stuff"</a>
> 
> needs to be converted to <a href="whatever.html"
> target='new_target'>&quot;my amazin&#039; links &amp; stuff&quot;</a>
> 
> Any ideas? I'm new to PHP and would rather not re-invent any wheels.

The method we (my company) use is to not allow the user to enter HTML at 
all - convert /all/ entities to HTML. Besides - how many ordinary users 
do you know that can write HTML?

So - we convert all characters to their entities, then when outputting, 
we reconvert, using agreed formatting tags. Some of them are:
  *bold*
  /italic/
  _underscore_
  [http://alink.com/|link's title]

I'm afraid we /did/ re-invent the wheel in that case, but only because I 
started writing that convertor well before I heard of similar scripts 
such as Textism (http://www.textism.com/tools/textile/).

What you could do is convert all quotes and ampersands, then reconvert 
the ones surrounded by '<' and '>'.

In PHP (not tested):
  $txt=htmlspecialchars($original,ENT_QUOTES);
  $txt=preg_replace('/\(<[^>]*\)&quot;\([^>]*>\)/','\1"\2/',$txt);
  $txt=preg_replace('/\(<[^>]*\)&apos;\([^>]*>\)/','\1\'\2/',$txt);
  $txt=preg_replace('/\(<[^>]*\)&amp;\([^>]*>\)/','\1\&\2/',$txt);

The last line (reconverting ampersands) should be reconsidered - plain 
ampersands are illegal in XHTML, and should only appear on their own 
when contained in a CDATA block.

Kae



More information about the Webdev mailing list