[ILUG-Webdev] PHP - converting HTML entities outside tags
Kae Verens
kae at verens.com
Wed Aug 25 07:14:22 IST 2004
Lee Hosty wrote:
> I'm using HTMLFilter (http://linux.duke.edu/projects/mini/htmlfilter/) to
> safely allow certain HTML tags as user input to be displayed later. I use
> htmlspecialchars() to display this text in a textarea during editting, so
> everything displays as valid XHTML at this stage, without any user
> confusion.
>
> However if the user inputs a HTML entity (&, " or ' for example), it gets
> saved as is to the DB - and displays fine as XHTML in a textarea - but
> when output to a browser at the viewing stage (as opposed to the editting
> stage) - these bare entities are not valid XHTML and need converting. I
> can do this either at the saving to DB stage or just before outputting to
> browser.
>
> However I can't blindly convert all HTML entities found to their relative
> values anymore using htmlspecialchars(), as some of the entities may be
> inside the tags that the user has input, and I don't want these converted.
>
> ie. user inputs <a href="whatever.html" target='new_target'>"my amazin'
> links & stuff"</a>
>
> needs to be converted to <a href="whatever.html"
> target='new_target'>"my amazin' links & stuff"</a>
>
> Any ideas? I'm new to PHP and would rather not re-invent any wheels.
The method we (my company) use is to not allow the user to enter HTML at
all - convert /all/ entities to HTML. Besides - how many ordinary users
do you know that can write HTML?
So - we convert all characters to their entities, then when outputting,
we reconvert, using agreed formatting tags. Some of them are:
*bold*
/italic/
_underscore_
[http://alink.com/|link's title]
I'm afraid we /did/ re-invent the wheel in that case, but only because I
started writing that convertor well before I heard of similar scripts
such as Textism (http://www.textism.com/tools/textile/).
What you could do is convert all quotes and ampersands, then reconvert
the ones surrounded by '<' and '>'.
In PHP (not tested):
$txt=htmlspecialchars($original,ENT_QUOTES);
$txt=preg_replace('/\(<[^>]*\)"\([^>]*>\)/','\1"\2/',$txt);
$txt=preg_replace('/\(<[^>]*\)'\([^>]*>\)/','\1\'\2/',$txt);
$txt=preg_replace('/\(<[^>]*\)&\([^>]*>\)/','\1\&\2/',$txt);
The last line (reconverting ampersands) should be reconsidered - plain
ampersands are illegal in XHTML, and should only appear on their own
when contained in a CDATA block.
Kae
More information about the Webdev
mailing list