Do you want to get the HTML source code of a webpage with Python selenium? In this article you will learn how to do that.
Selenium is a Python module for browser automation. You can use it to grab HTML code, what webpages are made of: HyperText Markup Language (HTML).
What is HTML source? This is the code that is used to construct a web page. It is a markup language.
To get it, first you need to have selenium and the web driver install. You can let Python fire the web browser, open the web page URL and grab the HTML source.
Related course:
Install Selenium
To start, install the selenium module for Python.
1 | pip install selenium |
For windows users, do this instead:
1 | pip.exe install selenium |
It’s recommended that you do that in a virtual environment using virtualenv.
If you use the PyCharm IDE, you can install the module from inside the IDE.
Make sure you have the web driver installed, or it will not work.
Selenium get HTML
You can retrieve the HTML source of an URL with the code shown below.
It first starts the web browser (Firefox), loads the page and then outputs the HTML code.
The code below starts the Firefox web rbowser, opens a webpage with the get() method and finally stores the webpage html with browser.page_source.
1 | #_*_coding: utf-8_*_ |
This is done in a few steps first importing selenium and the time module.
1 | from selenium import webdriver |
It starts the web browser with a single line of code. In this example we use Firefox, but any of the supported browsers. will do (Chrome, Edge, PhantomJS).
1 | # start web browser |
The URL you want to get is opened, this just opens the link in the browser.
1 | # get source code |
Then you can use the attribute .page_source to get the HTML code.
1 | html = browser.page_source |
You can then optionally output the HTML source (or do something else with it).
1 | time.sleep(2) |
Don’t forget to close the web browser.
1 | # close web browser |
If you are new to selenium, then I highly recommend this book.