Prerequisite:
Web scraping is a technique to fetch data from websites. While surfing on the web, many websites don’t allow the user to save data for personal use. One way is to manually copy-paste the data, which both tedious and time-consuming. Web Scraping is the automation of the data extraction process from websites. In this article we will discuss how we can download all images from a web page using python.
Modules Needed
- bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python.
- requests: Requests allows you to send HTTP/1.1 requests extremely easily. This module also does not come built-in with Python.
- os: The OS module in python provides functions for interacting with the operating system. OS, comes under Python’s standard utility modules. This module provides a portable way of using operating system dependent functionality.
Approach
- Import module
- Get HTML Code
- Get list of img tags from HTML Code using findAll method in Beautiful Soup.
images = soup.findAll('img')
Create separate folder for downloading images using mkdir method in os.
os.mkdir(folder_name)
- Iterate through all images and get the source URL of that image.
- After getting the source URL, last step is download the image
- Fetch Content of Image
r = requests.get(Source URL).content
- Download image using File Handling
# Enter File Name with Extension like jpg, png etc.. with open("File Name","wb+") as f: f.write(r)
Program:
Python3
from bs4 import * import requests import os # CREATE FOLDER def folder_create(images): try : folder_name = input ( "Enter Folder Name:- " ) # folder creation os.mkdir(folder_name) # if folder exists with that name, ask another name except : print ( "Folder Exist with that name!" ) folder_create() # image downloading start download_images(images, folder_name) # DOWNLOAD ALL IMAGES FROM THAT URL def download_images(images, folder_name): # initial count is zero count = 0 # print total images found in URL print (f "Total {len(images)} Image Found!" ) # checking if images is not zero if len (images) ! = 0 : for i, image in enumerate (images): # From image tag ,Fetch image Source URL # 1.data-srcset # 2.data-src # 3.data-fallback-src # 4.src # Here we will use exception handling # first we will search for "data-srcset" in img tag try : # In image tag ,searching for "data-srcset" image_link = image[ "data-srcset" ] # then we will search for "data-src" in img # tag and so on.. except : try : # In image tag ,searching for "data-src" image_link = image[ "data-src" ] except : try : # In image tag ,searching for "data-fallback-src" image_link = image[ "data-fallback-src" ] except : try : # In image tag ,searching for "src" image_link = image[ "src" ] # if no Source URL found except : pass # After getting Image Source URL # We will try to get the content of image try : r = requests.get(image_link).content try : # possibility of decode r = str (r, 'utf-8' ) except UnicodeDecodeError: # After checking above condition, Image Download start with open (f "{folder_name}/images{i+1}.jpg" , "wb+" ) as f: f.write(r) # counting number of image downloaded count + = 1 except : pass # There might be possible, that all # images not download # if all images download if count = = len (images): print ( "All Images Downloaded!" ) # if all images not download else : print (f "Total {count} Images Downloaded Out of {len(images)}" ) # MAIN FUNCTION START def main(url): # content of URL r = requests.get(url) # Parse HTML Code soup = BeautifulSoup(r.text, 'html.parser' ) # find all images in URL images = soup.findAll( 'img' ) # Call folder create function folder_create(images) # take url url = input ( "Enter URL:- " ) # CALL MAIN FUNCTION main(url) |
Output: