How to test Selenium scrapper with Python
This week I've been writing tests for a project which is using Selenium as a scrapper.
As you may know, Selenium is a testing framework, it's intended to be used while writing tests, not as a web crawler/scrapper.
But you can. Why? Because it runs a browser, so the Javascript gets executed, and we are happy. There are other solutions like Spynner or writing the scrapper in pure Javascript, but I felt comfortable using Selenium this way.
The problem¶
How do we test this scrapper? I want it to have tests mama!
Solution¶
Save the static content while running the scrapper, then, serve it with a very small http server while testing. Yes, it's a bit tedious, but it delivers.
When should you save?
Whenever you need to. This is a good example:
url = 'https://betterexplained.com/articles/why-do-we-multiply-combinations/'
driver = webdriver.Chrome()
driver.get(url)
save_current_state(driver.current_url, driver.page_source)
Can I see the code of save_current_state? Here you go:
import os
from urllib.parse import quote
SAVE_SOURCE = True ## disable in production
TEST_LOCATION = 'tests/static'
def save_current_state(url, source, location=None):
if not SAVE_SOURCE:
return None
if location is None:
location = TEST_LOCATION
if not isinstance(source, str):
raise TypeError('source must be a string')
filename = quote(url.strip('/') or '/', safe='') + '.html'
filepath = os.url.join(location, filename)
with open(filepath, 'w') as f:
f.write(source)
return filepath
Notice how the url is used as the file name (safely parsed). This helps a lot to match urls. But of course, a case where something custom is required may happen, so you can tune the server to fit any case.
And the python server? A minor modification from here
import os
import socket
from threading import Thread
from urllib.parse import quote
from http.server import BaseHTTPRequestHandler, HTTPServer
TEST_LOCATION = 'tests/static'
class MockServerRequestHandler(BaseHTTPRequestHandler):
def _set_headers(self):
self.send_response(200)
self.send_header('Set-Cookie', 'exampleid=c295IGVsIHVzdWFyaW8gZW5pdG8K')
self.send_header('Content-type', 'text/html') ## change at will
self.end_headers()
@property
def _filepath(self):
filename = quote(self.path.strip('/'), safe='')
return os.path.join(TEST_LOCATION, '{0}.html'.format(filename))
def _read_from_file_or_404(self):
try:
f = open(self._filepath, 'rb')
except FileNotFoundError:
self.send_response(404)
self.wfile.write(b'\n<html><body>404 Not Found!</body></html>')
else:
self.send_response(200)
## needs an extra new line
self.wfile.write(b'\n' + f.read())
f.close()
def do_GET(self):
self._set_headers()
self._read_from_file_or_404()
def do_POST(self):
self._set_headers()
self._read_from_file_or_404()
def log_message(self, format, *args):
"""Do not write log messages to std. Disable to see the requests."""
return
def get_free_port(hostname):
s = socket.socket(socket.AF_INET, type=socket.SOCK_STREAM)
s.bind((hostname, 0))