Just a small window to my IT Freelance world. Here is the bid I just a submitted for an Upwork published job I was invited to. I receive 2-5 such invitations every day to projects ranging from white hat hacking to Microsoft Office add-ins.
Hello,
thank you for the invitation!
I have developed a couple of systems that do scraping. The way I went about it is headless browser instances. Main reason for that was that anyone who wants to thwart scraping of their pages would implement something that works in a browser but does not work in something like a CURL based request; a typical example is dynamically built HTML page content using javascript vs the typical static HTML still prevalent on many websites (but diminishing). The core idea here is that if your scraping is using scripted actual browser instances instead of a bot that performs HTTP requests then you have much less to worry about.
So, what I did is I created a Java based manager that launches up to X concurrent headless instances of FireFox and drives them to dynamically and concurrently load pages stored in a MySQL via PHP scripts (the user interface). A special scraping javascript is injected in all pages by a custom Firefox extension - the script is customizable and can look for a number of things (elements/data) in each page loaded depending on the domain and a host of other customizable/programmable factors. It then makes XHR calls to a local PHP script that stores the desired extracted information in whatever format is needed (JSON, tabular in DB, Excel, plain text, etc.) It all runs on Linux.
I would be happy to modify this system to your spec - I think it will work well and it will be a better long term investment than a typical PHP/CURL based scraper. I can do it for a fixed price of 3K USD split in 3 well defined milestones with demonstrable/working deliverables each. Time for complete system delivery working to your spec would be about 3 weeks.
Looking forward to it!
Alex
No comments:
Post a Comment