modellingman wrote:
Although RandomAmbler's scraping tool (see several posts earlier on this thread) is certainly quick no one is presently sure whether it is up to this rate of activity or indeed whether the server it is running from or the Fool server it is targeting will tolerate such a rate over sustained periods. In any event, as I'm sure its author would acknowledge, it would need quite a bit of refinement to get it to the point where it could scrape to produce output suitable for loading into a DB (either directly or via intermediate text files).
You're not wrong modellingman. I've been spending some time today converting the tool to export posts into a pdf archive, rather than saving them individually, plus adding some useful information for sorting such as the post number on the board. On the whole this works fine but I have noticed the odd post being overlooked when extracting large numbers in one go (1000+) - probably because the TMF site hasn't responded promptly or a missing page response came back. Dealing with these niggling but important issues is the tricky bit as it tends to require much more edge-case code than the happy path.
Anyway if stooz has managed to scrape the entire set of discussion boards then that's really something - as I now know only too well!
Damian
ps FWIW the tool is still here:
https://damiancannon.github.io/MotleyFoolDownloader/