{"id":6395,"date":"2020-06-21T06:24:42","date_gmt":"2020-06-21T06:24:42","guid":{"rendered":"https:\/\/www.ardorsys.com\/?p=6395"},"modified":"2021-07-13T12:51:36","modified_gmt":"2021-07-13T12:51:36","slug":"web-scraping-python-common-roadblocks-solutions","status":"publish","type":"post","link":"https:\/\/www.ardorsys.com\/blog\/web-scraping-python-common-roadblocks-solutions\/","title":{"rendered":"Web scraping with Python: common roadblocks and solutions"},"content":{"rendered":"<p>Web scraping has been used to extract data from websites almost from the time the World Wide Web was born. In the early days, scraping was mainly done on static pages \u2013 those with known elements, tags, and data.<\/p>\n<p>More recently, however, advanced technologies in <a href=\"https:\/\/www.ardorsys.com\/website-development-services\/\">web development<\/a> have made the task a bit more difficult. In this article, we\u2019ll explore how we might go about scraping data in the case that new technology and other factors prevent standard scraping<\/p>\n<h5>Traditional data scraping<\/h5>\n<p>As most websites produce pages meant for human readability rather than automated reading, web scraping mainly consisted of programmatically digesting a web page\u2019s mark-up data (think right-click, View Source), then detecting static patterns in that data that would allow the program to \u201cread\u201d various pieces of information and save it to a file or a database.<br \/>\n<img  loading=\"lazy\"  decoding=\"async\"  class=\"alignnone pk-lazyload\"  src=\"data:image\/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABAQMAAAAl21bKAAAAA1BMVEUAAP+KeNJXAAAAAXRSTlMAQObYZgAAAAlwSFlzAAAOxAAADsQBlSsOGwAAAApJREFUCNdjYAAAAAIAAeIhvDMAAAAASUVORK5CYII=\"  alt=\"Data Scraping\"  width=\"1560\"  height=\"678\"  data-pk-sizes=\"auto\"  data-pk-src=\"https:\/\/bs-uploads.toptal.io\/blackfish-uploads\/uploaded_file\/file\/253814\/image-1589553330104-3887f4e1986e94fea6b7b2fbf7a2fbcb.png\" ><br \/>\nIf report data were to be found, often, the data would be accessible by passing either form variables or parameters with the URL. For example:<\/p>\n<pre><code>https:\/\/www.myreportdata.com?month=12&amp;year=2004&amp;clientid=24823<\/code><\/pre>\n<p><a href=\"https:\/\/www.ardorsys.com\/python-development\/\">Python<\/a> has become one of the most popular web scraping languages due in part to the various web libraries that have been created for it. One popular library,\u00a0<a href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Beautiful Soup<\/a>, is designed to pull data out of HTML and XML files by allowing searching, navigating, and modifying tags (i.e., the parse tree).<\/p>\n<h5>Browser-based scraping<\/h5>\n<p>Recently, I had a scraping project that seemed pretty straightforward and I was fully prepared to use traditional scraping to handle it. But as I got further into it, I found obstacles that could not be overcome with traditional methods.<\/p>\n<p>Three main issues prevented me from my standard scraping methods:<\/p>\n<ol>\n<li><strong>Certificate.<\/strong>\u00a0There was a certificate required to be installed to access the portion of the website where the data was. When accessing the initial page, a prompt appeared asking me to select the proper certificate of those installed on my computer, and click OK.<\/li>\n<li><strong>Iframes.<\/strong>\u00a0The site used iframes, which messed up my normal scraping. Yes, I could try to find all iframe URLs, then build a sitemap, but that seemed like it could get unwieldy.<\/li>\n<li><strong>JavaScript.<\/strong>\u00a0The data was accessed after filling in a form with parameters (e.g., customer ID, date range, etc.). Normally, I would bypass the form and simply pass the form variables (via URL or as hidden form variables) to the result page and see the results. But in this case, the form contained JavaScript, which didn\u2019t allow me to access the form variables in a normal fashion.<\/li>\n<\/ol>\n<p>So, I decided to abandon my traditional methods and look at a possible tool for browser-based scraping. This would work differently than normal \u2013 instead of going directly to a page, downloading the parse tree, and pulling out data elements, I would instead \u201cact like a human\u201d and use a browser to get to the page I needed, then scrape the data \u2013 thus, bypassing the need to deal with the barriers mentioned.<\/p>\n<h5>Selenium<\/h5>\n<p>In general,\u00a0<a href=\"https:\/\/www.selenium.dev\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Selenium<\/a>\u00a0is well-known as an open-source testing framework for <a href=\"https:\/\/www.ardorsys.com\/web-application-development\/\">web applications<\/a> \u2013 enabling\u00a0<a href=\"https:\/\/www.toptal.com\/qa\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">QA specialists<\/a>\u00a0to perform automated tests, execute playbacks, and implement remote control functionality (allowing many browser instances for load testing and multiple browser types). In my case, this seemed like it could be useful.<\/p>\n<p>My go-to language for web scraping is Python, as it has well-integrated libraries that can generally handle all of the functionality required. And sure enough, a\u00a0<a href=\"https:\/\/selenium-python.readthedocs.io\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Selenium library<\/a>\u00a0exists for Python. This would allow me to instantiate a \u201cbrowser\u201d \u2013 Chrome, Firefox, IE, etc. \u2013 then pretend I was using the browser myself to gain access to the data I was looking for. And if I didn\u2019t want the browser to actually appear, I could create the browser in \u201cheadless\u201d mode, making it invisible to any user.<\/p>\n<h5>Project setup<\/h5>\n<p>To start experimenting, I needed to set up my project and get everything I needed. I used a Windows 10 machine and made sure I had a relatively updated Python version (it was v. 3.7.3). I created a blank Python script, then loaded the libraries I thought might be required, using PIP (package installer for Python) if I didn\u2019t already have the library loaded. These are the main libraries I started with:<\/p>\n<ol>\n<li><a href=\"https:\/\/realpython.com\/python-requests\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Requests<\/a>\u00a0(for making HTTP requests)<\/li>\n<li><a href=\"https:\/\/docs.python.org\/3\/library\/urllib.request.html#module-urllib.request\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">URLLib3<\/a>\u00a0(URL handling)<\/li>\n<li><a href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Beautiful Soup<\/a>\u00a0(in case Selenium couldn\u2019t handle everything)<\/li>\n<li><a href=\"https:\/\/selenium-python.readthedocs.io\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Selenium<\/a>\u00a0(for browser-based navigation)<\/li>\n<\/ol>\n<p>I also added some calling parameters to the script (using the argparse library) so that I could play around with various datasets, calling the script from the command line with different options. Those included Customer ID, from- month\/year, and to-month\/year.<\/p>\n<h5>Problem 1 \u2013 the certificate<\/h5>\n<p>The first choice I needed to make was which browser I was going to tell Selenium to use. As I generally use Chrome, and it\u2019s built on the open-source Chromium project (also used by Edge, Opera, and Amazon Silk browsers), I figured I would try that first.<\/p>\n<p>I was able to start up Chrome in the script by adding the library components I needed, then issuing a couple of simple commands:<\/p>\n<pre><code># Load selenium components\r\nfrom selenium import webdriver\r\nfrom selenium.webdriver.common.by import By\r\nfrom selenium.webdriver.support.ui import WebDriverWait, Select\r\nfrom selenium.webdriver.support import expected_conditions as EC\r\nfrom selenium.common.exceptions import TimeoutException\r\n\r\n# Establish chrome driver and go to report site URL\r\nurl = \"https:\/\/reportdata.mytestsite.com\/transactionSearch.jsp\"\r\ndriver = webdriver.Chrome()\r\ndriver.get(url)\r\n<\/code><\/pre>\n<p>&nbsp;<\/p>\n<p>Since I didn\u2019t launch the browser in headless mode, the browser actually appeared and I could see what it was doing. It immediately asked me to select a certificate (which I had installed earlier).<\/p>\n<p>The first problem to tackle was the certificate. How to select the proper one and accept it in order to get into the website? In my first test of the script, I got this prompt:<br \/>\n<img  decoding=\"async\"  src=\"data:image\/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABAQMAAAAl21bKAAAAA1BMVEUAAP+KeNJXAAAAAXRSTlMAQObYZgAAAAlwSFlzAAAOxAAADsQBlSsOGwAAAApJREFUCNdjYAAAAAIAAeIhvDMAAAAASUVORK5CYII=\"  alt=\"Data Scraping\"  class=\" pk-lazyload\"  data-pk-sizes=\"auto\"  data-pk-src=\"https:\/\/bs-uploads.toptal.io\/blackfish-uploads\/uploaded_file\/file\/253815\/image-1589553429285-5bad2e1ce1ed8e6f9589d88e5de079bd.png\" ><br \/>\nThis wasn\u2019t good. I did not want to manually click the OK button each time I ran my script.<\/p>\n<p>As it turns out, I was able to find a workaround for this \u2013 without programming. While I had hoped that Chrome had the ability to pass a certificate name on startup, that feature did not exist. However, Chrome does have the ability to autoselect a certificate if a certain entry exists in your Windows registry. You can set it to select the first certificate it sees, or else be more specific. Since I only had one certificate loaded, I used the generic format.<br \/>\n<img  decoding=\"async\"  src=\"data:image\/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABAQMAAAAl21bKAAAAA1BMVEUAAP+KeNJXAAAAAXRSTlMAQObYZgAAAAlwSFlzAAAOxAAADsQBlSsOGwAAAApJREFUCNdjYAAAAAIAAeIhvDMAAAAASUVORK5CYII=\"  alt=\"Data Scraping\"  class=\" pk-lazyload\"  data-pk-sizes=\"auto\"  data-pk-src=\"https:\/\/bs-uploads.toptal.io\/blackfish-uploads\/uploaded_file\/file\/253816\/image-1589553502241-2efe1757ee581a4a1a048cc3635aa86b.png\" ><br \/>\nThus, with that set, when I told Selenium to launch Chrome and a certificate prompt came up, Chrome would \u201cAutoSelect\u201d the certificate and continue on.<\/p>\n<h5>Problem 2 \u2013 Iframes<\/h5>\n<p>Okay, so now I was in the site and a form appeared, prompting me to type in the customer ID and the date range of the report.<br \/>\n<img  decoding=\"async\"  src=\"data:image\/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABAQMAAAAl21bKAAAAA1BMVEUAAP+KeNJXAAAAAXRSTlMAQObYZgAAAAlwSFlzAAAOxAAADsQBlSsOGwAAAApJREFUCNdjYAAAAAIAAeIhvDMAAAAASUVORK5CYII=\"  alt=\"Data Scraping\"  class=\" pk-lazyload\"  data-pk-sizes=\"auto\"  data-pk-src=\"https:\/\/bs-uploads.toptal.io\/blackfish-uploads\/uploaded_file\/file\/253817\/image-1589553553467-9dc1f78c01442759ef54b1d05a8aba7f.png\" ><br \/>\nBy examining the form in developer tools (F12), I noticed that the form was presented within an iframe. So, before I could start filling in the form, I needed to \u201cswitch\u201d to the proper iframe where the form existed. To do this, I invoked Selenium\u2019s switch-to feature, like so:<\/p>\n<pre><code># Switch to iframe where form is\r\nframe_ref = driver.find_elements_by_tag_name(\"iframe\")[0]\r\niframe = driver.switch_to.frame(frame_ref)\r\n<\/code><\/pre>\n<p>Good, so now in the right frame, I was able to determine the components, populate the customer ID field, and select the date drop-downs:<\/p>\n<pre><code># Find the Customer ID field and populate it\r\nelement = driver.find_element_by_name(\"custId\")\r\nelement.send_keys(custId)  # send a test id\r\n\r\n# Find and select the date drop-downs\r\nselect = Select(driver.find_element_by_name(\"fromMonth\"))\r\nselect.select_by_visible_text(from_month)\r\nselect = Select(driver.find_element_by_name(\"fromYear\"))\r\nselect.select_by_visible_text(from_year)\r\nselect = Select(driver.find_element_by_name(\"toMonth\"))\r\nselect.select_by_visible_text(to_month)\r\nselect = Select(driver.find_element_by_name(\"toYear\"))\r\nselect.select_by_visible_text(to_year)\r\n<\/code><\/pre>\n<h5>Problem 3 \u2013 JavaScript<\/h5>\n<p>The only thing left on the form was to \u201cclick\u201d the Find button, so it would begin the search. This was a little tricky as the Find button seemed to be controlled by JavaScript and wasn\u2019t a normal \u201cSubmit\u201d type button. Inspecting it in developer tools, I found the button image and was able to get the XPath of it, by right-clicking.<br \/>\n<img  decoding=\"async\"  src=\"data:image\/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABAQMAAAAl21bKAAAAA1BMVEUAAP+KeNJXAAAAAXRSTlMAQObYZgAAAAlwSFlzAAAOxAAADsQBlSsOGwAAAApJREFUCNdjYAAAAAIAAeIhvDMAAAAASUVORK5CYII=\"  alt=\"Data Scraping\"  class=\" pk-lazyload\"  data-pk-sizes=\"auto\"  data-pk-src=\"https:\/\/bs-uploads.toptal.io\/blackfish-uploads\/uploaded_file\/file\/253818\/image-1589553607615-8e42826c0bf6b907f928945a9288e124.png\" ><br \/>\nThen, armed with this information, I found the element on the page, then clicked it.<\/p>\n<pre><code># Find the \u2018Find\u2019 button, then click it\r\ndriver.find_element_by_xpath(\"\/html\/body\/table\/tbody\/tr[2]\/td[1]\/table[3]\/tbody\/tr[2]\/td[2]<\/code><\/pre>\n<pre><code>\/input\").click()\r\n<\/code><\/pre>\n<p>And voil\u00e0, the form was submitted and the data appeared! Now, I could just scrape all of the data on the result page and save it as required. Or could I?<\/p>\n<h5>Getting the data<\/h5>\n<p>First, I had to handle the case where the search found nothing. That was pretty straightforward. It would display a message on the search form without leaving it, something like\u00a0<em>\u201cNo records found.\u201d<\/em>\u00a0I simply searched for that string and stopped right there if I found it.<\/p>\n<p>But if results did come, the data was presented in divs with a plus sign (+) to open a transaction and show all of its detail. An opened transaction showed a minus sign (-) which when clicked would close the div. Clicking a plus sign would call a URL to open its div and close any open one.<br \/>\n<img  decoding=\"async\"  src=\"data:image\/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABAQMAAAAl21bKAAAAA1BMVEUAAP+KeNJXAAAAAXRSTlMAQObYZgAAAAlwSFlzAAAOxAAADsQBlSsOGwAAAApJREFUCNdjYAAAAAIAAeIhvDMAAAAASUVORK5CYII=\"  alt=\"Data Scraping\"  class=\" pk-lazyload\"  data-pk-sizes=\"auto\"  data-pk-src=\"https:\/\/bs-uploads.toptal.io\/blackfish-uploads\/uploaded_file\/file\/253819\/image-1589553661163-565a16847922ef0ee91e150eae7e1c2d.png\" ><br \/>\nThus, it was necessary to find any plus signs on the page, gather the URL next to each one, then loop through each to get all data for every transaction.<\/p>\n<pre><code># Loop through transactions and count\r\nlinks = driver.find_elements_by_tag_name('a')\r\nlink_urls = [link.get_attribute('href') for link in links]\r\nthisCount = 0\r\nisFirst = 1\r\nfor url in link_urls:\r\nif (url.find(\"GetXas.do?processId\") &gt;= 0):  # URL to link to transactions\r\n       \tif isFirst == 1:  # already expanded +\r\n              \tisFirst = 0\r\nelse:\r\n       \tdriver.get(url)  # collapsed +, so expand\r\n# Find closest element to URL element with correct class to get tran type                            tran_type=driver.find_element_by_xpath(\"\/\/*[contains(@href,'\/retail\/transaction\/results\/GetXas.do?processId=-1')]\/following::td[@class='txt_75b_lmnw_T1R10B1']\").text\r\n              # Get transaction status\r\n              status = driver.find_element_by_class_name('txt_70b_lmnw_t1r10b1').text\r\n              # Add to count if transaction found\r\n              if (tran_type in ['Move In','Move Out','Switch']) and \r\n(status == \"Complete\"):\r\n                    thisCount += 1\r\n<\/code><\/pre>\n<p>&nbsp;<\/p>\n<p>In the above code, the fields I retrieved were the transaction type and the status, then added to a count to determine how many transactions fit the rules that were specified. However, I could have retrieved other fields within the transaction detail, like date and time, subtype, etc.<\/p>\n<p>For this project, the count was returned back to a calling application. However, it and other scraped data could have been stored in a flat file or a database as well.<\/p>\n<h5>Additional possible roadblocks and solutions<\/h5>\n<p>Numerous other obstacles might be presented while scraping modern websites with your own browser instance, but most can be resolved. Here are a few:<\/p>\n<ul>\n<li><strong>Trying to find something before it appears <\/strong>While browsing yourself, how often do you find that you are waiting for a page to come up, sometimes for many seconds? Well, the same can occur while navigating programmatically. You look for a class or other element \u2013 and it\u2019s not there!Luckily, Selenium has the ability to wait until it sees a certain element, and can timeout if the element doesn\u2019t appear, like so:<\/li>\n<\/ul>\n<pre><code>element = WebDriverWait(driver, 10). until(EC.presence_of_element_located((By.ID, \"theFirstLabel\"))) \r\n<\/code><\/pre>\n<ul>\n<li><strong>Getting through a Captcha <\/strong>Some sites employ Captcha or similar to prevent unwanted robots (which they might consider you). This can put a damper on web scraping and slow it way down.<\/li>\n<\/ul>\n<p>For simple prompts (like \u201cwhat\u2019s 2 + 3?\u201d), these can generally be read and figured out easily. However, for more advanced barriers, there are libraries that can help try to crack it. Some examples are\u00a0<a href=\"https:\/\/2captcha.com\/software\/2captcha-python-api\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">2Captcha<\/a>,\u00a0<a href=\"https:\/\/deathbycaptcha.com\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Death by Captcha<\/a>, and\u00a0<a href=\"http:\/\/bypasscaptcha.com\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Bypass Captcha<\/a>.<\/p>\n<ul>\n<li><strong>Website structural changes<\/strong> Websites are meant to change \u2013 and they often do. That\u2019s why when writing a scraping script, it\u2019s best to keep this in mind. You\u2019ll want to think about which methods you\u2019ll use to find the data, and which not to use. Consider partial matching techniques, rather than trying to match a whole phrase. For example, a website might change a message from \u201cNo records found\u201d to \u201cNo records located\u201d \u2013 but if your match is on \u201cNo records,\u201d you should be okay. Also, consider whether to match on XPATH, ID, name, link text, tag or class name, or CSS selector \u2013 and which is least likely to change.<\/li>\n<\/ul>\n<h5>Summary: Python and Selenium<\/h5>\n<p>This was a brief demonstration to show that almost any website can be scraped, no matter what technologies are used and what complexities are involved. Basically, if you can browse the site yourself, it generally can be scraped.<\/p>\n<p>Now, as a caveat, it does not mean that every website\u00a0<em>should<\/em>\u00a0be scraped. Some have legitimate restrictions in place, and there have been numerous\u00a0<a href=\"https:\/\/jaxenter.com\/data-scraping-cases-165385.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">court cases<\/a>\u00a0deciding the legality of scraping certain sites. On the other hand, some sites welcome and encourage data to be retrieved from their website and in some cases provide an API to make things easier.<\/p>\n<p>Either way, it\u2019s best to check with the terms and conditions before starting any project. But if you do go ahead, be assured that you can get the job done.<\/p>\n<p><em>The\u00a0<a href=\"https:\/\/www.toptal.com\/developers\/blog\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Toptal Engineering Blog<\/a>\u00a0is a hub for in-depth development tutorials and new technology announcements created by professional software engineers in the Toptal network. You can read the original piece written by\u00a0<a href=\"https:\/\/www.toptal.com\/resume\/neal-barnett\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Neal Barnett<\/a>\u00a0<a href=\"https:\/\/www.toptal.com\/python\/web-scraping-with-python\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">here<\/a>. Follow the Toptal Design Blog on\u00a0<a href=\"http:\/\/toptaldevs\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Twitter<\/a>\u00a0and\u00a0<a href=\"https:\/\/www.linkedin.com\/showcase\/toptaldevelopers\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">LinkedIn<\/a>.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"Web scraping has been used to extract data from websites almost from the time the World Wide Web was born. In the early days, scraping was mainly done on&#8230;\n","protected":false},"author":2,"featured_media":6451,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[44],"tags":[133],"class_list":{"0":"post-6395","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-web-development","8":"tag-python"},"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.ardorsys.com\/blog\/wp-json\/wp\/v2\/posts\/6395","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.ardorsys.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ardorsys.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ardorsys.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ardorsys.com\/blog\/wp-json\/wp\/v2\/comments?post=6395"}],"version-history":[{"count":0,"href":"https:\/\/www.ardorsys.com\/blog\/wp-json\/wp\/v2\/posts\/6395\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.ardorsys.com\/blog\/wp-json\/wp\/v2\/media\/6451"}],"wp:attachment":[{"href":"https:\/\/www.ardorsys.com\/blog\/wp-json\/wp\/v2\/media?parent=6395"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ardorsys.com\/blog\/wp-json\/wp\/v2\/categories?post=6395"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ardorsys.com\/blog\/wp-json\/wp\/v2\/tags?post=6395"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}