Working With DOM in PHP - Looking at a PHP HTML Parser

So, lets assume you’ve got a PHP project where you’re scraping pages and trying parse fields out of the DOM.  Up till now, I’ve just used regular expressions because they’re easy.  I avoided trying to parse html as xml using SimpleXML because there’s just to many cases where it would fail due to invalid tags.

Well, I feel like an idiot.  It turns out there’s a great extension built into PHP to do just that, and it’s the DOM extension.  Using this, parsing HTML with PHP is just as easy as accessing the DOM using JQuery. (hint: very easy).

Lets say we’ve got a page sitting on our local drive already.  For this example, I’ll use the homepage of this blog.  We’re going to parse out all the links.  I’ve saved the page as index.html and in the same directory I’ve created the parser script.

<?
$dom = new DomDocument;




// you can use loadHTML if you already have your string in memory
$dom->loadHTMLFile( "index.html" );
$dom->preserveWhiteSpace = false;

// grab all the A tags
// returns a domnodelist
$tags = $dom->getElementsByTagName( 'a' );

// you can actually iterate over the tags returned -




// I'm not sure why they don't say that more explicitly





echo "Total length:"  . count($tags->length) . "\n";

foreach($tags as $t)
{
	// each of these is a DOMElement object
	// the value is what's inside the tag
	// the attributes can also be accessed
	printf( "%-50s%s   \n", $t->nodeValue, $t->getAttribute('href') );

}<br></br>

Here’s a glimpse of the output:

vim                 http://www.rustyrazorblade.com/category/vim/  
virtual box         http://www.rustyrazorblade.com/category/virtual-box/   
vmware              http://www.rustyrazorblade.com/category/vmware/
weird               http://www.rustyrazorblade.com/category/weird/   
wikipedia           http://www.rustyrazorblade.com/category/wikipedia/
windows             http://www.rustyrazorblade.com/category/windows/  
xcode               http://www.rustyrazorblade.com/category/xcode/ 


Here’s another great reference I originally used to get started:

You can take this a bit further if you want to use the php curl extension.  Additionally, if you’re interested in using the advanced curl_multi_exec functionality, check out my previous post.

Edit: cynope on reddit suggested phpquery. I haven’t used it yet but it looks pretty cool. If I get a chance to try it I’ll post a followup.

If you found this post helpful, please consider sharing to your network. I'm also available to help you be successful with your distributed systems! Please reach out if you're interested in working with me, and I'll be happy to schedule a free one-hour consultation.