mirror of
https://github.com/len0rd/personal-website.git
synced 2025-03-01 12:02:14 -05:00
40 lines
3.9 KiB
Markdown
40 lines
3.9 KiB
Markdown
|
WADDUP MA BRO The 'decaptcha' system is an appendage to the [frontend](frontend-structure.md). It deals specifically with preparing Craigslist ads extracted by the fronend to be processed by [stager](stager-structure.md). In order for a listing to be processed by stager it needs to have some form of contact information. Craigslist hides contact information for listings behind [CAPTCHA's](https://en.wikipedia.org/wiki/CAPTCHA). Decaptcha automates a method to circumvent this and scrape the required contact info.
|
||
|
|
||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/gXANnPw2onA?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>
|
||
|
|
||
|
The above *private* video displays in gui-form how the current decaptcha process works. At present, Craigslist does not require captcha solutions from the user, but still hides the contact information behind an [invisible reCaptcha](https://developers.google.com/recaptcha/docs/invisible). This means that allowing the recaptcha javascript to run through an emulated browser is enough to circumvent the captcha.
|
||
|
|
||
|
## Process
|
||
|
|
||
|
While Craigslist may change their Captcha solution from time to time, the decaptcha process is generally the same:
|
||
|
|
||
|
1. The frontend scrape flags ads that were unable to load contact information. This flag typically takes the form of a custom 'stage' column value (ie: for Craigslist, stage is set to 'CL Lookup unassigned'). Some other information about the Ad is also taken.
|
||
|
|
||
|
2. The decaptcha system queries the database and finds ads to be processed(if any). Ads to run are prioritized by the liklihood of them having good contact info. For Craigslist, this means that ads with a 'Show Contact Info' button are run first, since there is always good contact info behind this button.
|
||
|
|
||
|
3. The decaptcha system loads the main details page of each ad. The idea is to emulate how a user would interact with the page as much as possible, therefore the contact info url is not directly loaded.
|
||
|
|
||
|
4. If a reCaptcha is encountered ("I'm not a robot checkbox"), then the relevant information is collected from the page, and a solution request is sent to [2Captcha](https://2captcha.com/). A valid solution request to the (current) 2Captcha API is shown below.
|
||
|
|
||
|
5. Decaptcha then waits for 2Captcha to solve the request and input the solution. If an invisible reCaptcha was encountered, decaptcha waits for it to resolve itself.
|
||
|
|
||
|
6. On completion, relevant information is collected, and the ads database entry is updated (including a stage value saying it's good to go).
|
||
|
|
||
|
## Historical Note
|
||
|
|
||
|
Historically, this entire process was handled by [casperJS](http://casperjs.org/)(a navigation scripting framework built atop [phantomJS](http://phantomjs.org/)), and [2Captcha](https://2captcha.com/). Requesting a captcha solution from 2Captcha was as easy as the HTTP request shown below:
|
||
|
|
||
|
```javascript
|
||
|
var params = {
|
||
|
method: userrecaptcha,
|
||
|
key: USER_API_KEY,
|
||
|
googlekey: CRAIGSLIST_SITE_KEY,
|
||
|
pageurl: REQUEST_URL--REPLY_BTN_or_SHOW_CONTACT_BTN,
|
||
|
json: '1'
|
||
|
};
|
||
|
var url = 'http://2captcha.com/in.php?' + params;
|
||
|
```
|
||
|
|
||
|
However casperJS' browser emulation proved troublesome with the latest invisible reCaptcha on Craigslist. In addition, 2Captcha was always seen as a necessary evil. Solutions from 2Captcha were generated by human workers and could take upwards of 60 seconds for a single listing. This made decaptcha the slowest process in the entire ByOwner pipeline. While 2Captcha was a reliable and workable solution, avoidance was optimal.
|
||
|
|
||
|
These issues presented an opportunity to move the entire decaptcha system into the Java stack already built for ByOwner. The latest solution uses [selenium](http://www.seleniumhq.org/): a multi-language api that plugs into various browser drivers(Chrome, Firefox) and runs them programmatically. This solution allows very user-like scraping, with similar or better resource utilization compared to casperJS. Given that this solution also avoids 2Captcha, process speed is up 2-4x.
|