Benchmarking HTML Parsing: PHP 8.4 vs WP_HTML_Tag_Processor

Parsing and manipulating HTML efficiently is a crucial task that developers face regularly. With PHP 8.4’s introduction of native HTML5 support and WordPress 6.2’s new HTML Tag Processor, let’s dive into a performance between these two approaches.

Background

WordPress 6.2 introduced the WP_HTML_Tag_Processor as a solution to a long-standing issue: the lack of reliable HTML5 parsing. Many developers were resorting to regex-based solutions (we’ve all read the famous Stack Overflow answers: “Don’t parse HTML with regex!”).

PHP 8.4 brought native HTML5 support to its DOM implementation.

The Test

I ran a benchmark comparing both approaches, processing a sample HTML structure 100,000 times.

The test included operations to select a special class by using query selector main > article:last-of-type

Here’s the sample HTML used in the test:

<main><article>First Article</article><article class="featured">Second Article</article><article class="featured special">Third Article</article><div class="container"><article class="nested featured">Nested Article</article></div></main>

Benchmarks:

ImplementationTotal Time (s)Avg Time per Operation (ms)
PHP 8.4 DOM0.68250.0068
WP_HTML_Tag_Processor2.87000.0287

Finding: PHP 8.4’s DOM implementation is approximately ~76.22% faster than WordPress’s HTML Tag Processor!

What This Means

While WordPress’s HTML Tag Processor serves its purpose well, especially in maintaining backward compatibility, PHP 8.4’s native DOM implementation shows impressive performance gains. The 76% performance difference suggests that WordPress could benefit from adopting native DOM operations when available while maintaining its current implementation for backward compatibility.

For developers starting new projects with PHP 8.4+, the choice is hard until the WP core update to use the new DOM API for PHP 8.4. However, WordPress engineers need to balance performance with compatibility requirements.

Benchmark code

Here is the code used for this benchmark:

<?php
/**
* DOM Operations Benchmark Test
*
* Comparing PHP 8.4 DOM vs WP_HTML_Tag_Processor
*/
class DOMBenchmark {
private const SAMPLE_HTML = <<<HTML
<main>
<article>First Article</article>
<article class="featured">Second Article</article>
<article class="featured special">Third Article</article>
<div class="container">
<article class="nested featured">Nested Article</article>
</div>
</main>
HTML;
private const ITERATIONS = 100000;
private $results = [];
/**
* PHP 8.4 DOM Implementation
*/
public function benchmarkPHP84DOM() {
$startTime = microtime( true );
$successEl = [];
for ( $i = 0; $i < self::ITERATIONS; $i ++ ) {
try {
$dom = \Dom\HTMLDocument::createFromString(
self::SAMPLE_HTML,
LIBXML_NOERROR
);
// Test various DOM operations
$lastArticle = $dom->querySelector( 'main > article:last-of-type' );
$successEl[] = $lastArticle?->classList->contains( 'special' );
}
catch ( Exception $e ) {
// Handle older PHP versions
$this->results['PHP84_DOM']['error'] = $e->getMessage();
return;
}
}
$endTime = microtime( true );
$this->results['PHP84_DOM'] = [
'time' => ( $endTime – $startTime ),
'iterations' => self::ITERATIONS,
'success' => $successEl
];
}
/**
* WP_HTML_Tag_Processor Implementation
*/
public function benchmarkWordPressDOM() {
$startTime = microtime( true );
$success = [];
require_once __DIR__ . '/class-wp-html-tag-processor.php';
require_once __DIR__ . '/class-wp-html-attribute-token.php';
require_once __DIR__ . '/class-wp-html-decoder.php';
require_once __DIR__ . '/class-wp-html-span.php';
for ( $i = 0; $i < self::ITERATIONS; $i ++ ) {
// Use WP_HTML_Tag_Processor instead of DOMDocument
$processor = new WP_HTML_Tag_Processor( self::SAMPLE_HTML );
$bookmark = 'maybe-last-article';
while ( $processor->next_tag( 'article' ) ) {
// Without using special class.
if ( ! $processor->has_class( 'nested' ) ) {
if ( $processor->has_bookmark( $bookmark ) ) {
$processor->release_bookmark( $bookmark );
}
$processor->set_bookmark( $bookmark );
}
}
$processor->seek( $bookmark );
$success[] = $processor->has_class( 'special' );
}
$endTime = microtime( true );
$this->results['WP_DOM'] = [
'time' => ( $endTime – $startTime ),
'iterations' => self::ITERATIONS,
'success' => $success
];
}
/**
* Run the benchmark
*/
public function run() {
echo "Starting DOM Operations Benchmark…\n";
// Run PHP 8.4 DOM benchmark
echo "Running PHP 8.4 DOM benchmark…\n";
$this->benchmarkPHP84DOM();
// Run WordPress DOM benchmark
echo "Running WordPress DOM benchmark…\n";
$this->benchmarkWordPressDOM();
// Display results
$this->displayResults();
}
/**
* Display benchmark results
*/
private function displayResults() {
echo "\nBenchmark Results:\n";
echo str_repeat( "-", 50 ) . "\n";
foreach ( $this->results as $type => $data ) {
if ( isset( $data['error'] ) ) {
echo "$type: Error – {$data['error']}\n";
continue;
}
if ( array_unique( $data['success'] ) !== [ true ] ) {
echo "$type: All featured articles are false\n";
continue;
}
$timePerOperation = ( $data['time'] / $data['iterations'] ) * 1000; // Convert to milliseconds
echo "$type:\n";
echo "Total Time: " . number_format( $data['time'], 4 ) . " seconds\n";
echo "Iterations: {$data['iterations']}\n";
echo "Average Time per Operation: " . number_format( $timePerOperation, 4 ) . " ms\n";
echo str_repeat( "-", 50 ) . "\n";
}
// Write which is faster and by how much %.
$php84Time = $this->results['PHP84_DOM']['time'];
$wpTime = $this->results['WP_DOM']['time'];
$percentFaster = ( ( $wpTime – $php84Time ) / $wpTime ) * 100;
$percentFaster = number_format( $percentFaster, 2 );
$faster = $php84Time < $wpTime ? 'PHP 8.4 DOM' : 'WordPress DOM';
echo "$faster is faster by ~$percentFaster%\n";
}
}
// Run the benchmark
$benchmark = new DOMBenchmark();
$benchmark->run();
view raw index.php hosted with ❤ by GitHub

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *