How to test Selenium scrapper with Python
This week I've been writing tests for a project which is using Selenium as a scrapper.
As you may know, Selenium is a testing framework, it's intended to be used while writing tests, not as a web crawler/scrapper.
But you can. Why? Because it runs a browser, and the browser is the real sh*t, so the Javascript gets executed, and we are happy. There are other solutions like Spynner or writing the scrapper in pure Javascript, but I felt comfortable using Selenium this way.
The problem
How do we test this scrapper? I want it to have tests, damn!
Solution
Save the static content while running the scrapper, then, serve it with a very small http server while testing. Yes, it's a bit tedious, but it delivers.
When should you save?
Whenever you need to. This is a good example:
url = 'https://betterexplained.com/articles/why-do-we-multiply-combinations/' driver = webdriver.Chrome() driver.get(url) save_current_state(driver.current_url, driver.page_source)
Can I see the code of save_current_state
? Here you go:
import os from urllib.parse import quote SAVE_SOURCE = True # disable in production TEST_LOCATION = 'tests/static' def save_current_state(url, source, location=None): if not SAVE_SOURCE: return None if location is None: location = TEST_LOCATION if not isinstance(source, str): raise TypeError('source must be a string') filename = quote(url.strip('/') or '/', safe='') + '.html' filepath = os.url.join(location, filename) with open(filepath, 'w') as f: f.write(source) return filepath
Notice how the url is used as the file name (safely parsed). This helps a lot to match urls. But of course, a case where something custom is required may happen, so you can tune the server to fit any case.
And the python server? A minor modification from here
import os import socket from threading import Thread from urllib.parse import quote from http.server import BaseHTTPRequestHandler, HTTPServer TEST_LOCATION = 'tests/static' class MockServerRequestHandler(BaseHTTPRequestHandler): def _set_headers(self): self.send_response(200) self.send_header('Set-Cookie', 'exampleid=c295IGVsIHVzdWFyaW8gZW5pdG8K') self.send_header('Content-type', 'text/html') # change at will self.end_headers() @property def _filepath(self): filename = quote(self.path.strip('/'), safe='') return os.path.join(TEST_LOCATION, '{0}.html'.format(filename)) def _read_from_file_or_404(self): try: f = open(self._filepath, 'rb') except FileNotFoundError: self.send_response(404) self.wfile.write(b'\n<html><body>404 Not Found!</body></html>') else: self.send_response(200) # needs an extra new line self.wfile.write(b'\n' + f.read()) f.close() def do_GET(self): self._set_headers() self._read_from_file_or_404() def do_POST(self): self._set_headers() self._read_from_file_or_404() def log_message(self, format, *args): """Do not write log messages to std. Disable to see the requests.""" return def get_free_port(hostname): s = socket.socket(socket.AF_INET, type=socket.SOCK_STREAM) s.bind((hostname, 0)) address, port = s.getsockname() s.close() return port def start_mock_server(hostname='localhost', port=None): if port is None: port = get_free_port(hostname) mock_server = HTTPServer((hostname, port), MockServerRequestHandler) mock_server_thread = Thread(target=mock_server.serve_forever) mock_server_thread.setDaemon(True) mock_server_thread.start() return '{hostname}:{port}'.format(hostname=hostname, port=port)
And finally the base test from which you will inherit, whenever you need to test the scrapper.
import unittest from my_project import const from tests import start_mock_server # or where you saved it class BaseScrapperTest(unittest.TestCase): @classmethod def setUpClass(cls): url = start_mock_server() const.HOSTNAME = url
Take a look at that last line, your project must have a central point where the HOSTNAME is set. Before testing, you need to tell to your application to hit your localserver.
Final notes
If you find hard to test some scrapper function, try dividing it into smaller functions, and testing them individually.
If a scrapper function does not include any condition, it's okay to return True
at the end,
and assert that boolean. If something goes wrong in the scrapper, we'll get noticied with an exception
and the test will throw an error. Also, if you wanted to receive a fail
instead of an error, which is more pythonic, you should do something like this in your test:
try: users_from_github() except ExceptionType: self.fail("users_from_github() raised ExceptionType unexpectedly!")
Try isolating the scrapper as much as possible from the rest of your project, whenever you need to use selenium, avoid including bussiness logic in it as well, this difficults testing and makes the code quite confusing.
Regards!