saversites Posted May 20, 2018 Report Share Posted May 20, 2018 Php Buddies, What I am trying to do is learn to build a simple web crawler. So at first, I will feed it a url to start with. It will then fetch that page and extract all the links into a single array. Then it will fetch each of those links pages and extract all their links into a single array likewise. It will do this until it reaches it's max link deep level. Here is how I coded it: <?php include('simple_html_dom.php'); $current_link_crawling_level = 0; $link_crawling_level_max = 2 if($current_link_crawling_level == $link_crawling_level_max) { exit(); } else { $url = 'https://www.yahoo.com'; $curl = curl_init($url); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); $html = curl_exec($curl); $current_link_crawling_level++; //to fetch all hyperlinks from the webpage $links = array(); foreach($html->find('a') as $a) { $links[] = $a->href; echo "Value: $value<br />\n"; print_r($links); $url = '$value'; $curl = curl_init($value); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); $html = curl_exec($curl); //to fetch all hyperlinks from the webpage $links = array(); foreach($html->find('a') as $a) { $links[] = $a->href; echo "Value: $value<br />\n"; print_r($links); $current_link_crawling_level++; } echo "Value: $value<br />\n"; print_r($links); } ?> I have a feeling I got confused and messed it up in the foreach loops. Nestled too much. Is that the case ? Hint where I went wrong. Unable to test the script as I have to first sort out this error:Fatal error: Uncaught Error: Call to a member function find() on string in C:\xampp\h After that, I will be able to test it. Anyway, just looking at the script, you think I got it right or what ? Thanks Quote Link to comment Share on other sites More sharing options...
saversites Posted May 20, 2018 Author Report Share Posted May 20, 2018 I just replaced: //$html = file_get_html('http://nimishprabhu.com'); with: $url = 'https://www.yahoo.com'; $curl = curl_init($url); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); $html = curl_exec($curl); That is all! That should not result in that error! :eek: Quote Link to comment Share on other sites More sharing options...
saversites Posted May 20, 2018 Author Report Share Posted May 20, 2018 UPDATE: I have been given this sample code just now ... Possible solution with str_get_html: $url = 'https://www.yahoo.com'; $curl = curl_init($url); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); $response_string = curl_exec($curl); $html = str_get_html($response_string); //to fetch all hyperlinks from a webpage $links = array(); foreach($html->find('a') as $a) { $links[] = $a->href; } print_r($links); echo "<br />"; Gonna experiment with it. Just sharing it here for other future newbies! Quote Link to comment Share on other sites More sharing options...
saversites Posted May 20, 2018 Author Report Share Posted May 20, 2018 I am told:"file_get_html is a special function from simple_html_dom library. If you open source code for simple_html_dom you will see that file_get_html() does a lot of things that your curl replacement does not. That's why you get your error." Anyway, folks, I really don't wanna be using this limited capacity file_get_html() and so let's replace it with cURL. I tried my best in giving a shot at cURL here. What-about you ? Care to show how to fix this thingY ? Quote Link to comment Share on other sites More sharing options...
saversites Posted May 20, 2018 Author Report Share Posted May 20, 2018 I did a search on the php manual for str_get_html to be sure what the function does. But, I am shown no results. And so, I ask: Just what does it do ? Quote Link to comment Share on other sites More sharing options...
saversites Posted May 20, 2018 Author Report Share Posted May 20, 2018 Php Buddies, Look at these 2 updates. They both succeed in fetching the php manual page but fail to fetch the yahoo homepage. Why is that ? The 2nd script is like the 1st one except a small change. Look at the commented-out parts in script 2 to see the difference. The added code comes after the commented-out code part. SCRIPT 1 <?php //HALF WORKING include('simple_html_dom.php'); $url = 'http://php.net/manual-lookup.php?pattern=str_get_html&scope=quickref'; // WORKS ON URL //$url = 'https://yahoo.com'; // FAILS ON URL $curl = curl_init($url); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); $response_string = curl_exec($curl); $html = str_get_html($response_string); //to fetch all hyperlinks from a webpage $links = array(); foreach($html->find('a') as $a) { $links[] = $a->href; } print_r($links); echo "<br />"; ?> SCRIPT 2 <?php //HALF WORKING include('simple_html_dom.php'); $url = 'http://php.net/manual-lookup.php?pattern=str_get_html&scope=quickref'; // WORKS ON URL //$url = 'https://yahoo.com'; // FAILS ON URL $curl = curl_init($url); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); $response_string = curl_exec($curl); $html = str_get_html($response_string); /* //to fetch all hyperlinks from a webpage $links = array(); foreach($html->find('a') as $a) { $links[] = $a->href; } print_r($links); echo "<br />"; */ // Hide HTML warnings libxml_use_internal_errors(true); $dom = new DOMDocument; if($dom->loadHTML($html, LIBXML_NOWARNING)){ // echo Links and their anchor text echo '<pre>'; echo "Link\tAnchor\n"; foreach($dom->getElementsByTagName('a') as $link) { $href = $link->getAttribute('href'); $anchor = $link->nodeValue; echo $href,"\t",$anchor,"\n"; } echo '</pre>'; }else{ echo "Failed to load html."; } ?> Don't forget my previous post! Cheers! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.