How to test Selenium scrapper with Python
This week I've been writing tests for a project which is using Selenium as a scrapper.
As you may know, Selenium is a testing framework, it's intended to be used while writing tests, not as a web crawler/scrapper.
But you can. Why? Because it runs a browser, so the Javascript gets executed, and we are happy. There are other solutions like Spynner or writing the scrapper in pure Javascript, but I felt comfortable using Selenium this way.
The problem¶
How do we test this scrapper? I want it to have tests mama!
Solution¶
Save the static content while running the scrapper, then, serve it with a very small http server while testing. Yes, it's a bit tedious, but it delivers.
When should you save?
Whenever you need to. This is a good example:
url = 'https://betterexplained.com/articles/why-do-we-multiply-combinations/'
driver = webdriver.Chrome()
driver.get(url)
save_current_state(driver.current_url, driver.page_source)
Can I see the code of save_current_state? Here you go:
import os
from urllib.parse import quote
SAVE_SOURCE = True ## disable in production
TEST_LOCATION = 'tests/static'
def save_current_state(url, source, location=None):
if not SAVE_SOURCE:
return None
if location is None:
location = TEST_LOCATION
if not isinstance(source, str):
raise TypeError('source must be a string')
filename = quote(url.strip('/') or '/', safe='') + '.html'
filepath = os.url.join(location, filename)
with open(filepath, 'w') as f:
f.write(source)
return filepath
Notice how the url is used as the file name (safely parsed). This helps a lot to match urls. But of course, a case where something custom is required may happen, so you can tune the server to fit any case.
And the python server? A minor modification from here
import os
import socket
from threading import Thread
from urllib.parse import quote
from http.server import BaseHTTPRequestHandler, HTTPServer
TEST_LOCATION = 'tests/static'
class MockServerRequestHandler(BaseHTTPRequestHandler):
def _set_headers(self):
self.send_response(200)
self.send_header('Set-Cookie', 'exampleid=c295IGVsIHVzdWFyaW8gZW5pdG8K')
self.send_header('Content-type', 'text/html') ## change at will
self.end_headers()
@property
def _filepath(self):
filename = quote(self.path.strip('/'), safe='')
return os.path.join(TEST_LOCATION, '{0}.html'.format(filename))
def _read_from_file_or_404(self):
try:
f = open(self._filepath, 'rb')
except FileNotFoundError:
self.send_response(404)
self.wfile.write(b'\n<html><body>404 Not Found!</body></html>')
else:
self.send_response(200)
## needs an extra new line
self.wfile.write(b'\n' + f.read())
f.close()
def do_GET(self):
self._set_headers()
self._read_from_file_or_404()
def do_POST(self):
self._set_headers()
self._read_from_file_or_404()
def log_message(self, format, *args):
"""Do not write log messages to std. Disable to see the requests."""
return
def get_free_port(hostname):
s = socket.socket(socket.AF_INET, type=socket.SOCK_STREAM)
s.bind((hostname, 0))
address, port = s.getsockname()
s.close()
return port
def start_mock_server(hostname='localhost', port=None):
if port is None:
port = get_free_port(hostname)
mock_server = HTTPServer((hostname, port), MockServerRequestHandler)
mock_server_thread = Thread(target=mock_server.serve_forever)
mock_server_thread.setDaemon(True)
mock_server_thread.start()
return '{hostname}:{port}'.format(hostname=hostname, port=port)
And finally the base test from which you will inherit, whenever you need to test the scrapper.
import unittest
from my_project import const
from tests import start_mock_server ## or where you saved it
class BaseScrapperTest(unittest.TestCase):
@classmethod
def setUpClass(cls):
url = start_mock_server()
const.HOSTNAME = url
Take a look at that last line, your project must have a central point where the HOSTNAME is set. Before testing, you need to tell to your application to hit your localserver.
Final notes¶
If you find hard to test some scrapper function, try dividing it into smaller functions, and testing them individually.
If a scrapper function does not include any condition, it's okay to
return True at the end, and assert that boolean. If something goes
wrong in the scrapper, we'll get noticed with an exception and the
test will throw an error. Also, if you wanted to receive a
fail instead of an error,
which is more pythonic, you should do something like this in your test:
try:
users_from_github()
except ExceptionType:
self.fail("users_from_github() raised ExceptionType unexpectedly!")
Try isolating the scrapper as much as possible from the rest of your project, whenever you need to use selenium, avoid including business logic in it as well, this difficulties testing and makes the code quite confusing.
Regards!