Logging Into Websites Using Request
Solution 1:
This is going to be quite tricky and you might want to use a more sophisticated tool like Selenium to be able to emulate a browser.
Otherwise, you will need to investigate what cookies or other type of authentication is required for you to log in to the site. Note all the cookies that are being passed behind the scenes -- it's not quite as simple as just entering in the username/password to be able to log in here. You can see what information is being passed by viewing the Network tab in your web browser.
Finally, if you are worried that Selenium might be 'sluggish' (it is -- after all, it is doing the same thing a user would be doing when opening a browser and clicking things), then you can try something like CasperJS, though the learning curve to implement something with this is quite steeper than Selenium -- you might want to try with Selenium first.
Solution 2:
Scraping sites can be hard.
Some sites send you well-formed HTML and all you need to do is search within it to find data / links, whatever you need for scraping.
Some sites send you poorly-formed HTML. Browsers, over the years have become pretty excepting of "bad" HTML and do the best they can to interpret what the HTML is trying to do. The down-side is if you're using a strict parser to decipher the HTML it may fail: you need something able to work with fuzzy data. Or, just brute force with regex. Your use of xpath
only works if the resulting HTML creates a well-formed XML document.
Some sites (more and more these days) send a bit of HTML, and javascript, and perhaps JSON, XML, whatever to the browser. The Browser then constructs the final HTML (the DOM) and displays it to the user. That's what you have here.
You want to scrape the final DOM, but as that's not what the site is sending you. So, you either need to scrape what they send (for example, you figure out that the link you want can be determined from the JSON they send {books: [{title: "Graphs of Wrath", code: "a88kyyedkgH"}]}
==> example.com/catalog?id=a88kyyedkgH
. Or, you scrape through a browser (e.g. using Selenium), letting the browser do all the requests, build up the DOM and then you scrape the result. It's slower, but it works.
When it gets hard, consider:
- The site probably doesn't want you to be doing this & (we) webmasters have just as many tools to make your life harder and harder.
- Alternatively, there may be a published API designed for you to get most of the information (Amazon is a great example). (My guess is Amazon knows it can't beat all the webscrapers, so it's better for them to offer a way which doesn't consume so many resources on their main servers.)
Post a Comment for "Logging Into Websites Using Request"