In this article, we’ll see how to make both a simple and relatively advanced web-crawler (or spider-bot) in PHP. The simpler one will simply output all the links it finds in a webpage while the advanced one will add the titles, keywords and descriptions to a conceptual database (conceptual means there is no SQL database being used in this article)! Compared to Google even our advanced web-crawler is actually just a simple web-crawler since our crawler doesn’t use any AI-agent! We’ll go through a total of 3 iterations before concluding the article – each one of them with an explanation.
Note: Throughout this article, I’ll use the words spider-bot and web-crawler interchangeably. Some people may use them in a different context but in this article, both words mean essentially the same thing.
There are a lots of things you can do to improve this spider-bots and make it more advanced – add functionality like maintaining a popularity index and also implementing some anti-spam features like penalizing websites with no content or websites using “click-bait” strategies such as adding keywords that have nothing to do with the content of the page! Also, you could try to generate the keywords and description from the page which is something GoogleBot’s do all the time. Below is a list of relevant articles that you can look up if you want to improve this spider-bot.
A Simple SpiderBot: The simple version will be non-recursive and will simply print all the links it finds in a web-page. Note that all of our main-logic will happen in followLinks
function!
- Program:
<?php
function
followLink(
$url
) {
// We need this options when creating context
$options
=
array
(
'http'
=>
array
(
'method'
=>
"GET"
,
'user-agent'
=>
"gfgBot/0.1\n"
)
);
// Create context for communication
$context
=stream_context_create(
$options
);
// Create a new HTML DomDocument for web-scraping
$doc
=
new
DomDocument();
@
$doc
-> loadHTML(
file_get_contents
(
$url
, false,
$context
) );
// Get all the anchor nodes in the DOM
$links
=
$doc
-> getElementsByTagName(
'a'
);
// Iterate through all the anchor nodes
// found in the document
foreach
(
$links
as
$i
)
echo
$i
->getAttribute(
'href'
) .
'<br/>'
;
}
?>
- Output: Now, this was no good – we get only one link – that’s because we only have one link in the site
example.com
and since we are not recursing we don’t follow the link that we got. You could runfollowLink("http://apple.com")
if you want to see it in complete action. If however, you use neveropen.com then you may get some error since neveropen will block our request (for security reasons of course).https://www.iana.org/domains/example
Explanation:
- Line 3: We are creating an
$options
array. You don’t have to understand much about it other than that this will be required in context-creation. Note that theuser-agent
name is gfgBot – you can change this to what you like. You can even use GoogleBot to fool a website into thinking that your crawler is Google’s spider-bot as long as it uses this method for finding out the bot. - Line 10: We are creating context for communication. For anything you need context – to tell a story you need a context. To create a window in OpenGL you need a context – same for HTML5 Canvas and same for PHP Network Communication! Sorry if I got out of “context” but I had to do that.
- Line 13: Create a DomDocument which is basically a data structure for DOM handling used generally for HTML and XML files.
- Line 14: We load HTML by providing the contents of the document! This process may create some warnings (since it’s kind deprecated) so we suppress all the warnings.
- Line 17: We create basically an array of all the anchor nodes that we find in the DOM.
- Line 21: We print all the links that those anchor nodes reference to.
- Line 24: We get all the links in the website example.com! It has only one-link which is outputted.
A slightly more complicated Spider-Bot: In the previous code we had a basic spider-bot and it was good but it was more of a scraper then a crawler (for difference between scraper and crawler see this article). We weren’t recursing – we weren’t “following” the links that we got. So in this iteration, we’ll do just that and we’ll also assume we have a database in which we’d insert the links (for indexing). Any link will be inserted in the database via
insertIntoDatabase
function!- Program:
<?php
// List of all the links we have crawled!
$crawledLinks
=
array
();
function
followLink(
$url
,
$depth
= 0){
global
$crawledLinks
;
$crawling
=
array
();
// Give up to prevent any seemingly infinite loop
if
(
$depth
>5){
echo
"<div style='color:red;'>The Crawler is giving up!</div>"
;
return
;
}
$options
=
array
(
'http'
=>
array
(
'method'
=>
"GET"
,
'user-agent'
=>
"gfgBot/0.1\n"
)
);
$context
= stream_context_create(
$options
);
$doc
=
new
DomDocument();
@
$doc
-> loadHTML(
file_get_contents
(
$url
, false,
$context
));
$links
=
$doc
->getElementsByTagName(
'a'
);
foreach
(
$links
as
$i
){
$link
=
$i
->getAttribute(
'href'
);
if
(ignoreLink(
$link
))
continue
;
$link
= convertLink(
$url
,
$link
);
if
(!in_array(
$link
,
$crawledLinks
)){
$crawledLinks
[] =
$link
;
$crawling
[] =
$link
;
insertIntoDatabase(
$link
,
$depth
);
}
}
foreach
(
$crawling
as
$crawlURL
){
echo
(
"<span style='color:grey;margin-left:"
.(10*
$depth
).
";'>"
.
"[+] Crawling <u>$crawlURL</u></span><br/>"
);
followLink(
$crawlURL
,
$depth
+1);
}
if
(
count
(
$crawling
)==0)
echo
(
"<span style='color:red;margin-left:"
.(10*
$depth
).
";'>"
.
"[!] Didn't Find any Links in <u>$url!</u></span><br/>"
);
}
// Converts Relative URL to Absolute URL
// No conversion is done if it is already in Absolute URL
function
convertLink(
$site
,
$path
){
if
(
substr_compare
(
$path
,
"//"
, 0, 2) == 0)
return
parse_url
(
$site
)[
'scheme'
].
$path
;
substr_compare
(
$path
,
"www."
, 0, 4) == 0)
return
$path
;
// Absolutely an Absolute URL!!
else
return
$site
.
'/'
.
$path
;
}
// Whether or not we want to ignore the link
function
ignoreLink(
$url
){
return
$url
[0]==
"#"
or
substr
(
$url
, 0, 11) ==
"javascript:"
;
}
// Print a message and insert into the array/database!
function
insertIntoDatabase(
$link
,
$depth
){
echo
(
"<span style='margin-left:"
.(10*
$depth
).
"'>"
.
"Inserting new Link:- <span style='color:green'>$link"
.
"</span></span><br/>"
);
$crawledLinks
[]=
$link
;
}
?>
- Output:
Inserting new Link:- http://guimp.com//home.html [+] Crawling http://guimp.com//home.html Inserting new Link:- http://www.guimp.com Inserting new Link:- http://guimp.com//home.html/pong.html Inserting new Link:- http://guimp.com//home.html/blog.html [+] Crawling http://www.guimp.com Inserting new Link:- http://www.guimp.com/home.html [+] Crawling http://www.guimp.com/home.html Inserting new Link:- http://www.guimp.com/home.html/pong.html Inserting new Link:- http://www.guimp.com/home.html/blog.html [+] Crawling http://www.guimp.com/home.html/pong.html [!] Didn't Find any Links in http://www.guimp.com/home.html/pong.html! [+] Crawling http://www.guimp.com/home.html/blog.html [!] Didn't Find any Links in http://www.guimp.com/home.html/blog.html! [+] Crawling http://guimp.com//home.html/pong.html [!] Didn't Find any Links in http://guimp.com//home.html/pong.html! [+] Crawling http://guimp.com//home.html/blog.html [!] Didn't Find any Links in http://guimp.com//home.html/blog.html!
Explanation:
- Line 3: Create a global array –
$crawledLinks
which contain all the links that we have captured in the session. We’ll use it for lookup to see whether or not a link is already in the database! Looking up in an array is less efficient than looking up in a hashtable. We could use a hashtable but it won’t be very efficient than an array since the keys are very long strings (a URL) So I believe using an array would be faster. - Line 8: We tell the interpreter that we are using the global array
$crawledLinks
that we just created! And in the next line we create a new array$crawling
which would simply contain all the links that we are currently crawling over. - Line 31: We ignore all the links that do not link to an external page! A link could be an internal link, deep link or system link! This function doesn’t check for every case (that’d make it very long) but the two most common cases – when a link is an internal link and when a link refers to a javascript code.
- Line 33: We convert a link from relative link to absolute link as well do some other conversions (like //wikipedia.org to http://wikipedia.org or https://wikipedia.org depending on the scheme of the original URL).
- Line 35: We simply check if the
$link
that we are iterating is not already in the database. If it is then we ignore it – if not then we add it to the database as well as in the$crawling
array so that we could follow the links in that URL as well. - Line 43: Here the crawler recurses. It follows all the links that it has to follow (links that were added in the
$crawling
array). - Line 83: We call
followLink("http://guimp.com/"),
we use the URL http://guimp.com/ as the starting point just cause it so happens to be (or claims to be) the smallest website in the world.
More advanced spider-bot: In the previous iteration, we recursively followed all the links that we got on a page and added them in a database (which was just an array). But we added only the URL to the database, however, search-engines have a lot of fields for each page – the thumbnail, the author information, date and time and most importantly the title of the page and the keywords. Some even have a cached copy of the page for faster search. We’ll, however – for the sake of simplicity, only scrape out the title, description and the keywords from the page.
Note: It’s left to you which database you use – PostgreSQL, MariaDB, etc, we’ll only output Inserting URL/Text, etc since handling with external databases are out of this article’s scope!
The description and keywords are present in the meta tags. Some search-engines search are based (almost) entirely on the meta-data information while some search-engines don’t give them much relevance. Google doesn’t even take them into consideration, their search is based entirely on the popularity and relevance of a page (using the PageRank algorithm) and the keywords and description are generated rather than extracting them from the meta tags. Google doesn’t penalize a website without any description or keywords. But it does penalize the websites without titles. Our conceptual search engine (which will be built using this “advanced” spider-bot) will do the opposite, it will penalize the websites without description and keywords (even though it would add them to the database yet it will give them lower-ranking) and it will not penalize the websites without titles. It’ll set the URL of the website as the title.
- Program:
<?php
$crawledLinks
=
array
();
const
MAX_DEPTH=5;
function
followLink(
$url
,
$depth
=0){
global
$crawledLinks
;
$crawling
=
array
();
if
(
$depth
>MAX_DEPTH){
echo
"<div style='color:red;'>The Crawler is giving up!</div>"
;
return
;
}
$options
=
array
(
'http'
=>
array
(
'method'
=>
"GET"
,
'user-agent'
=>
"gfgBot/0.1\n"
)
);
$context
=stream_context_create(
$options
);
$doc
=
new
DomDocument();
@
$doc
->loadHTML(
file_get_contents
(
$url
, false,
$context
));
$links
=
$doc
->getElementsByTagName(
'a'
);
$pageTitle
=getDocTitle(
$doc
,
$url
);
$metaData
=getDocMetaData(
$doc
);
foreach
(
$links
as
$i
){
$link
=
$i
->getAttribute(
'href'
);
if
(ignoreLink(
$link
))
continue
;
$link
=convertLink(
$url
,
$link
);
if
(!in_array(
$link
,
$crawledLinks
)){
$crawledLinks
[]=
$link
;
$crawling
[]=
$link
;
insertIntoDatabase(
$link
,
$pageTitle
,
$metaData
,
$depth
);
}
}
foreach
(
$crawling
as
$crawlURL
)
followLink(
$crawlURL
,
$depth
+1);
}
function
convertLink(
$site
,
$path
){
if
(
substr_compare
(
$path
,
"//"
, 0, 2)==0)
return
parse_url
(
$site
)[
'scheme'
].
$path
;
substr_compare
(
$path
,
"www."
, 0, 4)==0)
return
$path
;
else
return
$site
.
'/'
.
$path
;
}
function
ignoreLink(
$url
){
return
$url
[0]==
"#"
or
substr
(
$url
, 0, 11) ==
"javascript:"
;
}
function
insertIntoDatabase(
$link
,
$title
, &
$metaData
,
$depth
){
echo
(
"Inserting new record {URL= $link"
.
", Title = '$title'"
.
", Description = '"
.
$metaData
['description'].
"', Keywords = ' "
.
$metaData
[
'keywords'
].
"'}<br/><br/><br/>"
);
$crawledLinks
[]=
$link
;
}
function
getDocTitle(&
$doc
,
$url
){
$titleNodes
=
$doc
->getElementsByTagName(
'title'
);
if
(
count
(
$titleNodes
)==0
or
!isset(
$titleNodes
[0]->nodeValue))
return
$url
;
$title
=
str_replace
(
''
,
'\n'
,
$titleNodes
[0]->nodeValue);
return
(
strlen
(
$title
)<1)?
$url
:
$title
;
}
function
getDocMetaData(&
$doc
){
$metaData
=
array
();
$metaNodes
=
$doc
->getElementsByTagName(
'meta'
);
foreach
(
$metaNodes
as
$node
)
$metaData
[
$node
->getAttribute(
"name"
)]
=
$node
->getAttribute(
"content"
);
if
(!isset(
$metaData
[
'description'
]))
$metaData
[
'description'
]=
'No Description Available'
;
if
(!isset(
$metaData
[
'keywords'
]))
$metaData
[
'keywords'
]=
''
;
return
array
(
'keywords'
=>
str_replace
(
''
,
'\n'
,
$metaData
[
'keywords'
]),
'description'
=>
str_replace
(
''
,
'\n'
,
$metaData
[
'description'
])
);
}
?>
- Output:
Inserting new record {URL= https://www.iana.org/domains/example, Title = 'Example Domain', Description = 'No Description Available', Keywords = ' '}
Explanation: There is nothing ground-breaking change yet I’d like to explain a few stuff:
- Line 3: We are creating a new global constant
MAX_DEPTH
. Previously we simply used 5 as the maximum depth but this time we useMAX_DEPTH
constant in place of that. - Line 22 & Line 23: We are basically getting the title of the page in
$pageTitle
and the description and keywords which would be stored in the$metaData
variable (an associative array). You can refer to line 64 and line 72 to know about the information that was abstracted. - Line 31: We pass in some extra parameters to the
insertIntoDatabase
function.
Issues with our web-crawler: We have created this web-crawler only for learning. Deploying it into production code (like making a search-engine out of it) can create some serious problems. The following are some issues with our web-crawler:
- It isn’t scalable Our web-crawler cannot crawl billions of web-pages like GoogleBot.
- It doesn’t quite obey the standard of crawler communication with websites. It doesn’t follow the
robots.txt
for a site and will crawl a site even if the site administrator requests not to do so. - It is not automatic. Sure it will “automatically” get all the URLs of the current-page and crawl each one of them but it’s not exactly automatic. It doesn’t have any concept of Crawl Frequency.
- It is not distributed. If two spider-bots are running then there’s currently no way that they could communicate with each other (to see if the other spider-bot is not crawling the same page)
- Parsing is way too simple. Our spider-bot will not handle encoded markup (or even encoded URLs for that matter)
<!–
–>
How to make a redirect in PHP?
How to make dynamic chart in PHP using canvasJS?
How to make PDF file downloadable in HTML link using PHP ?
How to make asynchronous HTTP requests in PHP ?
How to make a leaderboard using PHP ?
How to make a connection with MySQL server using PHP ?
How to make async functions in PHP ?
PHP 5 vs PHP 7
PHP | Get PHP configuration information using phpinfo()
PHP | php.ini File Configuration - Output: Now, this was no good – we get only one link – that’s because we only have one link in the site
Please Login to comment…