Jump to content

Recommended Posts

Posted

Php Buddies,

What I am trying to do is learn to build a simple web crawler.
So at first, I will feed it a url to start with.
It will then fetch that page and extract all the links into a single array.
Then it will fetch each of those links pages and extract all their links into a single array likewise. It will do this until it reaches it's max link deep level.
Here is how I coded it:

<?php 
	include('simple_html_dom.php'); 
	$current_link_crawling_level = 0; 
$link_crawling_level_max = 2
	if($current_link_crawling_level == $link_crawling_level_max)
{
    exit(); 
}
else
{
    $url = 'https://www.yahoo.com'; 
    $curl = curl_init($url); 
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
    curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
    $html = curl_exec($curl); 
    
    $current_link_crawling_level++;    
	    //to fetch all hyperlinks from the webpage 
    $links = array(); 
    foreach($html->find('a') as $a) 
    { 
        $links[] = $a->href; 
        echo "Value: $value<br />\n"; 
        print_r($links); 
        
        $url = '$value'; 
        $curl = curl_init($value); 
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
        curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
        curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
        $html = curl_exec($curl); 
	        //to fetch all hyperlinks from the webpage 
        $links = array(); 
        foreach($html->find('a') as $a) 
        { 
            $links[] = $a->href; 
            echo "Value: $value<br />\n";
            print_r($links); 
            $current_link_crawling_level++;
        } 
    echo "Value: $value<br />\n";
    print_r($links);  
}
	?>
	

I have a feeling I got confused and messed it up in the foreach loops. Nestled too much. Is that the case ? Hint where I went wrong.

Unable to test the script as I have to first sort out this error:
Fatal error: Uncaught Error: Call to a member function find() on string in C:\xampp\h

After that, I will be able to test it. Anyway, just looking at the script, you think I got it right or what ?

Thanks

Posted

I just replaced:

//$html = file_get_html('http://nimishprabhu.com');

with:

$url = 'https://www.yahoo.com'; 
$curl = curl_init($url); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
$html = curl_exec($curl); 

That is all!
That should not result in that error! :eek:

Posted

UPDATE:

I have been given this sample code just now ...

Possible solution with str_get_html:
	$url = 'https://www.yahoo.com'; 
$curl = curl_init($url); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
$response_string = curl_exec($curl); 
	$html = str_get_html($response_string);
//to fetch all hyperlinks from a webpage 
$links = array(); 
foreach($html->find('a') as $a) { 
    $links[] = $a->href; 
} 
print_r($links); 
echo "<br />"; 

Gonna experiment with it.
Just sharing it here for other future newbies! :)

Posted

I am told:
"file_get_html is a special function from simple_html_dom library. If you open source code for simple_html_dom you will see that file_get_html() does a lot of things that your curl replacement does not. That's why you get your error."

Anyway, folks, I really don't wanna be using this limited capacity file_get_html() and so let's replace it with cURL. I tried my best in giving a shot at cURL here. What-about you ? Care to show how to fix this thingY ?

Posted

Php Buddies,

Look at these 2 updates. They both succeed in fetching the php manual page but fail to fetch the yahoo homepage. Why is that ?
The 2nd script is like the 1st one except a small change. Look at the commented-out parts in script 2 to see the difference. The added code comes after the commented-out code part.

SCRIPT 1

<?php 
	//HALF WORKING
	include('simple_html_dom.php'); 
	$url = 'http://php.net/manual-lookup.php?pattern=str_get_html&scope=quickref'; // WORKS ON URL
//$url = 'https://yahoo.com'; // FAILS ON URL
	$curl = curl_init($url); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
$response_string = curl_exec($curl); 
	$html = str_get_html($response_string);
	//to fetch all hyperlinks from a webpage 
$links = array(); 
foreach($html->find('a') as $a) { 
    $links[] = $a->href; 
} 
print_r($links); 
echo "<br />"; 

?>

SCRIPT 2

<?php 
	//HALF WORKING
	include('simple_html_dom.php'); 
	$url = 'http://php.net/manual-lookup.php?pattern=str_get_html&scope=quickref'; // WORKS ON URL
//$url = 'https://yahoo.com'; // FAILS ON URL
$curl = curl_init($url); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
$response_string = curl_exec($curl); 
	$html = str_get_html($response_string);
	/*
//to fetch all hyperlinks from a webpage 
$links = array(); 
foreach($html->find('a') as $a) { 
    $links[] = $a->href; 
} 
print_r($links); 
echo "<br />"; 
*/
	// Hide HTML warnings
libxml_use_internal_errors(true);
$dom = new DOMDocument;
if($dom->loadHTML($html, LIBXML_NOWARNING)){
    // echo Links and their anchor text
    echo '<pre>';
    echo "Link\tAnchor\n";
    foreach($dom->getElementsByTagName('a') as $link) {
        $href = $link->getAttribute('href');
        $anchor = $link->nodeValue;
        echo $href,"\t",$anchor,"\n";
    }
    echo '</pre>';
}else{
    echo "Failed to load html.";
	}
	?>
	

Don't forget my previous post!

Cheers!

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...