How to get the content (data) from another site with PHP
March 28, 2008 7 comments
Take the contents of another site was the task that had to do. I thought it would be easy, but things came out exactly as I planned… Fortunately, succeeded! Learn, you, too, and collect information from other websites on the internet.
The mission: get the content of a site
Today, at work, had the task of "pull" some data from one site to "graft" in an excerpt from a web site that I am helping to develop. When I was last task as soon remembered a time when a friend did the same thing (and, at the time, he explained me, more or less, it did). I thought it would be very easy to accomplish the task, but in time, "reminded" that I am not as good as the Gevã… rsrs
The solution: native functions of PHP and regular expressions
I spent for research on the web, desperately looking for reference material on regular expressions (which, as you will see, is one of the pillars for achieving get some content from another site). Search here, looking for there to chat with the Gevã rolled onto I try to better understand such regular expressions. Before that, using a little "gambiarra", managed to achieve my goal.
The first thing to do is pick up the entire contents of the page you want to "manipulate". To this end uses, for example, the function file_get_contents (). Playing the site content in a variable, would be well ('ll make an example with the same site that had to move):
1 | $ url = file_get_contents ( 'http://www.bcb.gov.br/'); |
Ready, now contains the variable $ url, string, the entire contents of the home page of the BCB.
Regular Expressions
So here was easy, the difficult catch was only the portion of the page that I wanted… For this, you must use regular expressions!
If you do not know what are regular expressions, there vai a brief (and incomplete…) explanation of what are regular expressions, from Wikipedia:
A regular expression in Information Technology, defines a standard to be used to seek or substituting words or groups of words. It's a must to make searches of certain portions of text.
With regular expressions is possible to identify snippets of words or groups of words that match ( "match") to a certain standard ( "pattern"), which is "regular".
The PHP has several functions native to work with regular expressions. Just to know that each one serves and use of meneira correct.
Well, the next step towards solving the problem is to identify the pattern you want to extract the site. In the case, I needed a table with some foreign exchange rates (is on the right, the site of the BCB). Noting the source code, I saw that the information is in a table (semantically correct approach!); Moreover, this table is surrounded between the HTML comments "<! - - HOME INDICATORS - ->" and "<! - - FIMINDICADORES -- --> ". This is a good thing, since it facilitates rather to "identify the pattern."
Explaining better: the pattern sought in the case, is all that is between the HTML comments "<! - - HOME INDICATORS - ->" and "<! - - FIMINDICADORES - ->." The whole table is there, for my happiness! :-)
So what do we need to do is to use a PHP function (for case, chose the preg_match_all ()) to search for a regular expression within the variable $ url, which contains the entire first page of the site is where the table.
After much time testing vááárias regular expressions, I came to the conclusion that would make a small mutreta. But first, let me show how the code is until now.
1 2 | = $ url file_get_contents ( 'http://www.bcb.gov.br/'); preg_match_all ( '/ ORES - >(. )<!--/ s', $ url, $ content); |
Explaining: the first argument of the function, is the standard that I seek; the second, is where I will try and the third, guard in an array all occurrences of the words sought. For most cases, it may have been good, here, but I had no problem, still need a few more lines of code.
Currently, the variable $ content contains an array with the occurrences found. Using a print_r (), discovered in which position, exactly what I was looking for: $ content [0] [0].
Making a "mutreta"
To make this "mutreta", the game contents of the matrix position to another variable (to facilitate handling).
1 2 3 | $ url = file_get_contents ( 'http://www.bcb.gov.br/'); preg_match_all ( '/ ORES - >(. )<!--/ s', $ url, $ content); $ display = $ Content [0] [0]; |
There were snippets of what I was returned (ie the table with the foreign exchange rates) that I did not want to appear on the site (as some links). So I decided to withdraw them by the function str_replace (), which replaces portions of strings. It is possible, as an argument, pass an array. So for now and nearing the end, the code is this:
1 2 3 4 5 | $ url = file_get_contents ( 'http://www.bcb.gov.br/'); preg_match_all ( '/ ORES - >(. )<!--/ s', $ url, $ content); $ display = $ Content [0] [0]; withdraw $ = array ( 'most currencies',' Copom minutes', 'more', 'ORES ->','<!--'); $ display = str_replace ( $ withdraw,'', $ display); |
In other words, where some of the items appear in the array $ withdraw a $ display, will be replaced by "" (nothing…). Note that the last two elements of the array are "pollution" unnecessary, which came due to my laziness to make a regular expression more elaborate. :-)
Finally…
After that, just mandar display on the screen that "sobrou" cut the content of the home page of the site BCB.
1 2 3 4 5 6 | $ url = file_get_contents ( 'http://www.bcb.gov.br/'); preg_match_all ( '/ ORES - >(. )<!--/ s', $ url, $ content); $ display = $ Content [0] [0] </ span>; withdraw $ = array ( 'most currencies',' Copom minutes', 'more', 'ORES ->','<!--') $ display = str_replace ($ withdraw,'', $ display); echo $ display; |
And if the server does not allow the function file_get_contents ()?
There are many servers that, for various reasons (mainly "security"), do not allow to use the function file_get_contents (). For these cases, it is possible to place a variable any external page using the following code (after you demand explanations in the official handbook of PHP):
1 2 3 4 5 6 7 | $ ch = curl_init (); $ timeout = 0; curl_setopt ($ ch, CURLOPT_URL, 'O_SITE_QUE_VOCE_QUER'); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($ ch, CURLOPT_CONNECTTIMEOUT, $ timeout); $ content = curl_exec ( $ ch); curl_close ($ ch); |
Then, the content of the page "O_SITE_QUE_VOCE_QUER" will be in the variable $ content.
Conclusion
Then, staff, to withdraw a part of the contents of a web site (using PHP), the steps are:
- Learn which page is that the content is accurate;
- Play the site content in a variable;
- Extract the words you want to use regular expressions;
- If necessary, cut some things more the result of the ER;
- Display screen in the final result.
The great leap of the cat in this case is whether dealing with regular expressions; thing, I inform, but you learn doing! And doing much! Read the references at the end of this article and find more material on the Internet about it.
Another important thing is: please keep in mind that as you are taking the contents of a site where that change its structure, most likely will be necessary to change the regular expression, too.
Be aware of one thing: if you, web developer, not yet specified use regular expressions, you can be assured that its hour come vai!
References National
Regular Expressions - Quick Guide to Consultation
It is the guide of Aurelius Marino Jargas, excellent to learn regular expressions and to consult in times of tightening!
Article of Viva Linux, Marcelo Santos Araujo, with an introduction on regular expressions.
References International
Site dedicated to regular expressions.
Virtual library of regular expressions.
To test regular expressions in real time!














Excellent, sr.Tárcio! You got a lemon and made a lemonade. The article was very good, congratulations!
Thank you, Your Gevã!
50% is your credit! D =
Abraços!
Remember that with the use of XHTML pages in the structure of this is unnecessary, since the extraction of data can be made exactly the same way that it extracts data from an XML.
@ Rafael Eduardo Kassner
Hi, Rafael, OK?
Very interesting that you mentioned! It could set an example so we can learn?
Abraços!
Thanks to hand!
People who want to delve, vai there a great link on ER:
http://guia-er.sourceforge.net/
To me it was very useful!
@ Juliano
Thank you for reminding the site of Aurelius. Better even buy the book is the reference he made, but for online consultations, it is very good!
Thank you for visiting!
Trackback on August 26, 2008