Update on TMF data repository

Meatyfool · #1953

All,

I PMed TMFTarantula after his last update on the old site. I didn't ask his permission to post online so I'll keep things limited.

When he referred to data as a bit raw, we are talking stored in a database. No surprises there then.

He did tell me that an example post of mine 280 chars long resulted in 40000 bytes of HTML, so scraping the whole site would equate to iro 6gb. He doesn't recommend it.

The obvious benefit to scraping is it is something we can do without involvement from techies at TMF Towers. Would still need licensing.

I'd imagine LOOTP and similar boards would be unscraped. If the scrape is designed to run on a board by board basis, we can spread the hit over the whole three months.

Now, the 64k question.

Has anyone made a formal approach to TMF to take on the existing boards after the end of the three months? If we are talking tech involvement their end, we need to be quick. I would not be at all surprised if the short timescales are down to the fact that they needed to make a decision about renewing annual software licenses.

Meatyfool..

Breelander · #1962

Meatyfool wrote: I would not be at all surprised if the short timescales are down to the fact that they needed to make a decision about renewing annual software licenses.

I would be very surprised. I can't see that any licences would be involved. The boards were said to be an in-house 'TMF Techie' design - cutting edge at the time when nothing like it had been seen before. Trouble was the 'time' was 1998 and the design has been stuck in a time-warp ever since.

modellingman · #2164

Meatyfool wrote:All,

I PMed TMFTarantula after his last update on the old site. I didn't ask his permission to post online so I'll keep things limited.

When he referred to data as a bit raw, we are talking stored in a database. No surprises there then.

He did tell me that an example post of mine 280 chars long resulted in 40000 bytes of HTML, so scraping the whole site would equate to iro 6gb. He doesn't recommend it.

The obvious benefit to scraping is it is something we can do without involvement from techies at TMF Towers. Would still need licensing.

I'd imagine LOOTP and similar boards would be unscraped. If the scrape is designed to run on a board by board basis, we can spread the hit over the whole three months.

I guess that this is TMFTarantula's post that you are referring to: http://boards.fool.co.uk/hi-everyone-ra ... 58233.aspx
and the relevant bit of that post is

Would it be possible to export Board content to another system? Or licence it for 3rd party use?

Yes, either of those may be possible, but there is a very large amount of data and it's in a relatively 'raw' format, so it may be difficult to import it gracefully into another system. However, we could potentially pursue this option if a viable solution can be found.

I think this is a very positive response.

To my way of thinking, this signifies that potentially content of the old boards can be preserved without the need to resort to some sort of scraping mechanism. There is an implicatation in TMFTarantula's response that the database content could be "dumped". I would hazard a guess that given its age and the use of aspx pages for displaying its content the underlying database is a SQL Server database and, if this is the case, it shouldn't be at all technically challenging to export the post content to a similar database for onward use elsewhere.

The basic requirement would be what in logical terms amounts to a "Posts table" (containing the individual posts and their content), a "Threads table" (allowing posts to be grouped into threads) and a "Boards table" (allowing threads to be grouped to boards). I suspect that the "Users table" wouldn't be made available because of data protection considerations. (It would for example, expose user's email addresses.)

If the content is preserved in this way then the question of an ultimate repository does become slightly secondary. For example, this (lemonfool) is running on fairly standard phpBB software. Whilst it will no doubt be a little more complicated than outlined above in my simple logical schema of Posts, Threads, Boards and Users tables, there will be an underlying relational database sitting at the heart of any scalable bulletin board software and the real work will be in transforming the data (ie post content) from whatever database scheme is made available from TMF to whatever system is used as the ultimate repository. This work need not involve TMF at all and the reliance on TMF assistance would be limited to making their content available and documenting the format (such as table definitions) of that content.

In terms of your comments about 280 characters becoming 40,000 bytes of HTML, I'm not surprised.

I think this refers to a standard TMF webpage for viewing a post e.g. http://boards.fool.co.uk/if-tmf-history ... 57328.aspx. HTML is very verbose and a great deal of the content is oriented towards providing an attractive looking layout. This particular post is not long (just 1205 characters) but the webpage's representation in HTML comprises nearly 43,000 characters of text. In essence, the overhead on each webpage is around 42,000 characters. Even if a scraping approach were to be employed, it would not make sense preserve this page overhead for every post. Instead as each post is scraped, the overhead part would be discarded and only the "meat" of the post would be preserved. This is contained in a <blockquote> element with a class attribute of "pbmsg", so not difficult to identify, particularly if the HTML is transformed to a DOM representation.

If a scraping approach were to be employed, and I'm in agreement with TMFTarantula in not recommending this approach, then there would be some considerable challenges.

First, sheer volume. My example post above was not chosen casually. It identifies the number of posts on the non-company TMF boards as at the end of last week. There are well over 6 million of them. Unlikely, that scraping on this scale would be achieved even in a three month timescale without collaborative effort, so probably a virtual team of volunteers would be needed to do the scraping. Managing such a team to ensure that the output they produced was useable, consistent and complete would present its own challenges. If the problems of scale were to be reduced by deciding not to download every post then how would decisions be made, another set of challenges there no doubt.

Second, scraping posts by themselves is OK but overlooks their organisation into threads and boards. Here, the variety of display modes (unthreaded, threaded, expanded/collapsed, etc) on TMF for listing a series of posts presents traps as only one of these is really suited to the task of identifying which thread and board each post belongs to. Getting everyone in the virtual team to follow a set of instructions exactly is not impossible but will require very carefully explained instructions, based on a thoroughly tested procedure, if the end results are to be usesable.

Third, even if the challenges above can be overcome, it would then be necessary to pull together the output created by perhaps dozens of individuals and pull it together into some sort of coherent whole (ie a database) which could then form the basis of make it publicly available. It is the coherence which is the problem here. There will undoubtedly be duplication, missing items, orphans and a whole host of other problems to be dealt with at this stage. If the boards are pulled at the end of 3 months, there may be no opportunity to go back and resolve such problems.

Much better, IMHO, to pursue the option of persuading TMF to go down the route of making the post content available in bulk (as TMFTarantula has said may be possible) and, once the content has been secured, to identify a suitable mechanism to make it viewable on the web.

Unfortunately, I'm not really in a position to do much technical work over the next few months. I'm happily sitting in the sun on a Spanish rock in the Atlantic with only a very underpowered (Intel Atom) notebook and a not overly reliable internet connection. However, I am happy to contribute whatever expertise and ideas that I do have and would suggest that as starting point, a single thread is identifying for managing the task of acquiring the historic data from boards.fool.co.uk as at the moment it appears to be spread over several threads and boards both here on lemonfool.co.uk and on boards.fool.co.uk

modellingman (who seems to have become his Mum on becoming a lemonFool - fat finger syndrome!)

RandomAmbler · #2215

modellingmam wrote:If a scraping approach were to be employed, and I'm in agreement with TMFTarantula in not recommending this approach, then there would be some considerable challenges.

...

Much better, IMHO, to pursue the option of persuading TMF to go down the route of making the post content available in bulk (as TMFTarantula has said may be possible) and, once the content has been secured, to identify a suitable mechanism to make it viewable on the web.

modellingman (who seems to have become his Mum on becoming a lemonFool - fat finger syndrome!)

Good post modellingman. I agree that getting a dump of the database would be the best way to go - or at least a sub-set of related tables.

That said I've hacked together a download tool which extracts all of the posts for a user from a specific board and have made it available here: https://damiancannon.github.io/MotleyFoolDownloader/

The code isn't overly complicated (or particularly well organised at the moment) but I'm happy to allow others to take it further if useful and/or necessary.

Still let's hope that we can get an extract from the database which can be transformed and loaded somewhere else.

Damian

Gengulphus · #2219

modellingmam wrote:This work need not involve TMF at all and the reliance on TMF assistance would be limited to making their content available and documenting the format (such as table definitions) of that content.

I don't particularly want to be a wet blanket, but it's pretty clear that the TMF board system is old, near-impossible-to-change code. In my experience, such code is usually badly documented - and the exceptions are when it is undocumented! And bad documentation of data structures is usually a major culprit - basically, if the data structures were well-documented, you could use their definitions in a divide-and-conquer attack on the main code. Typically, the problems are that some sort of added feature or other upgrade has made changes to the data structures and the code using them, but hasn't updated the documentation of the data structures to match. Then another upgrade comes along and runs into some sort of problem due to the data structures not being as described, but manages to find some sort of work-around - and that work-around involves further changes to the data structures, which also don't get documented properly because of time pressure, inadequate understanding of the previous changes to the data structures, or both. Repeat ten or twenty times and you have a real mess on your hands - which by all accounts is what TMF has!

The point of which is that I'd be rather doubtful that TMF can supply the documentation you want. An out-of-date and flawed approximation to it, maybe, but I would definitely budget in some effort to investigate how well the actual data structures match the documentation, and where the answer is 'not well', to puzzle out what the documentation really ought to say...

Gengulphus

Breelander · #2254

Gengulphus wrote: ...I would definitely budget in some effort to investigate how well the actual data structures match the documentation, and where the answer is 'not well', to puzzle out what the documentation really ought to say...

Have a virtual Rec, Gengulphus.

I've done that migration between two well documented databases - Joomla to BBPress. There's also a problem of incompatible data structures (Joomla had a different structure for an opening post of a thread, while BBPress had the same structure for all posts with a flag to mark the OP). This can lead to a lot of manual corrections of a bulk 'find and replace' nature before importing to the new database. In my case it was worse, I had to manually add the title to each individual OP. Even on a small club site like mine the task was prohibitive - I ended up only importing the most significant forums.

modellingman · #2400

Gengulphus wrote:
The point of which is that I'd be rather doubtful that TMF can supply the documentation you want. An out-of-date and flawed approximation to it, maybe, but I would definitely budget in some effort to investigate how well the actual data structures match the documentation, and where the answer is 'not well', to puzzle out what the documentation really ought to say...

Gengulphus

I suppose what I had in mind was running some queries against the existing database to spew out the logical equivalents of Posts, Threads and Boards. tables. That broadly would be the TMF work. The queries themselves would define the output results so from this perspective, the documentation of the handed over content would be minimal. Some work would be necessary to specify the output requirement but that strikes me as being fairly straightforward and I've listed my starter for 10 below.

Of course, this is all based on assumptions that a) the technical ability to extract the data to meet these requirements can be found and b) the current database has become so convoluted that nobody in Fool Towers understands it anymore. This would be where the limitations on the documentation of data structures would most likely manifest themselves. (As an aside I suspect that if there are real holes in the documentation, these are much more related to the aspx scripts that query the underlying database and then generate the webpages served to the likes of you and me. The functionality which remembers which threads and posts a user has read appears fairly bespoke).

But until somebody specs some data requirements out and asks TMF whether it is technically feasible to produce the data to meet these requirements we'll never know. Then the only routes available to preserving the content will be either screen-scraping or some third-party mechanism like the Wayback Machine. Given the volume of content and the uncertainties on continued availability of the webpages neither of these alternatives strike me as being particularly attractive or feasible.

modellingman

Suggested requirements spec (initial thoughts):

Posts Table
- Post id
- Post text including the embedded <b>, <i>, <pre> and <a> tags
- Date/time of creation (submission)
- Username of post creator
- Subject of post
- No of recs. received (though this is very much a nice-to-have)
Keyed on Post id

Threads Table
- Thread id
- Thread title
Keyed on Thread id

Boards Table
- Board id
- Board title
Keyed on Board id

Threads/Posts relation
- Thread id
- Post id
Keyed jointly on Thread id and Message id

Boards/Threads relation
-Board id
-Thread id
Keyed jointly on Thread id and Message id

With the exception of the Thread id, all these data are currently exposed through the pages we know and have loved. In a scraping approach, it would be necessary to scrape posts on a thread-by-thread basis and to manage the creation and allocation of a Thread id as the scraping proceeded. Whilst it might be tempting to rely solely on Thread title as being the unique identifiers of threads, this would be a poor choice. Whilst most posters, generally follow a convention of not re-using old thread titles when "posting new", this is not guaranteed and there are examples of different threads having the same title. For example, many boards have as their initial thread a thread with the title "Greetings, Fool!"

Gromley · #2549

modellingmam wrote:...

With the exception of the Thread id, all these data are currently exposed through the pages we know and have loved. In a scraping approach, it would be necessary to scrape posts on a thread-by-thread basis and to manage the creation and allocation of a Thread id as the scraping proceeded. Whilst it might be tempting to rely solely on Thread title as being the unique identifiers of threads, this would be a poor choice. Whilst most posters, generally follow a convention of not re-using old thread titles when "posting new", this is not guaranteed and there are examples of different threads having the same title. For example, many boards have as their initial thread a thread with the title "Greetings, Fool!"

Broadly agree with your field list modellingman and I can confirm it is possible to scrape all of this - I wrote and executed a scraper a few years back when I got totally frustrated with the search function.

I don't agree with the above paragraph though.

> It is totally possible to scrape by post id, starting with 1 (in fact from memory I don't think they started from 1, but it is perfectly easy to work that out).
> In a development in recent years the url of a post was changed to include part of the post title, it does also include the post id, also the Subject field after the Re for any replies, returns post ids as well.
> You can also use this "re" link to re-create threads - even differentiating between threads of the same name or you can use the "whole thread" link.

To scrape the data, the post url in the format : boards.fool.co.uk/13456940.aspx is all you need; as the results of this will give you everything you need.

From memory though, scrapeing the boards is quite a slow process (or that may have been my internet connection) - so if that option has to be the one to pursue[*] it would be best to start that early and/or to do it from with Fool towers.

* Even having been granted access to the back end databases I have in a couple of instances in the past found it easier to scrape data from the "front end" where it is clear what each field means rather than extract from the back end tables which may have some prosaic field names.

With the apparent goodwill of TMF then this task definitely achievable; it just needs someONE to go and do it; looks like there is a longish list of potential volunteers - to which I'd add myself, but it would be good to have stooz confirm if this is in hand or if a volunteer is needed.

I've now become as guilty as anyone on speculating on the method, but the truth is IT CAN BE DONE, so we need someone to JFDI.

Cheers,

Gromley

Gromley · #2556

Sorry just to avoid offense to people that I REALLY do not wish to offend - I probably should have said stooz/clariman to confirm, but I'd kind of assumed that the former was leading on this bit. Anyway if it is not obvious I should clarify BOTH of you guys rock!

#2562

Gromley wrote:Sorry just to avoid offense to people that I REALLY do not wish to offend - I probably should have said stooz/clariman to confirm, but I'd kind of assumed that the former was leading on this bit. Anyway if it is not obvious I should clarify BOTH of you guys rock!

No problem. My personal view on it is that importing legacy data into Lemonfool would bring problems of data volumes, copyright and technical issues. Linking to a repository or archive would be fine. I'd prefer to focus on keeping and growing this community and storing the new wealth of knowledge that it will create.

That said, there are others looking at this who have asked to speak to Stooz and me. I'm open minded and I think Stooz is going to speak to them.

C

#2570

Gromley wrote:Sorry just to avoid offense to people that I REALLY do not wish to offend - I probably should have said stooz/clariman to confirm, but I'd kind of assumed that the former was leading on this bit. Anyway if it is not obvious I should clarify BOTH of you guys rock!

I agree - I thought they'd be doing this. I'd love to see a copy of the TMF data, if its a DB or if its a load of 'cached' files (I know one forum that doesn't use a DB for posts, but builds them into pre-rendered files for speed) doesn't matter to me, I'd see about migrating from whatever there was in place. I've done worse before!

For migrating into a forum, I doubt it'd be easy to add the posts to LemonFool - mainly because the LF has gained a lot of users and posts already, and partly because the forums do not match all the ones in the old TMF boards. Depending how the raw data is organised this could be easy to ignore certain boards, or difficult, but I expect the easiest way to migrate the TMF data is to pop it all into a brand new forum and then see about migrating some of the LF posts (if any) into it.

We can't all ask TMFTarantula for the data, so I'm going to leave that negotiation to Stooz and Clariman. Let us know if you want to delegate that task guys, but otherwise the balls' in your court.

modellingman · #2582

Gromley wrote:
modellingmam wrote:...

With the exception of the Thread id, all these data are currently exposed through the pages we know and have loved. In a scraping approach, it would be necessary to scrape posts on a thread-by-thread basis and to manage the creation and allocation of a Thread id as the scraping proceeded. Whilst it might be tempting to rely solely on Thread title as being the unique identifiers of threads, this would be a poor choice. Whilst most posters, generally follow a convention of not re-using old thread titles when "posting new", this is not guaranteed and there are examples of different threads having the same title. For example, many boards have as their initial thread a thread with the title "Greetings, Fool!"

Broadly agree with your field list modellingman and I can confirm it is possible to scrape all of this - I wrote and executed a scraper a few years back when I got totally frustrated with the search function.

I don't agree with the above paragraph though.

> It is totally possible to scrape by post id, starting with 1 (in fact from memory I don't think they started from 1, but it is perfectly easy to work that out).
> In a development in recent years the url of a post was changed to include part of the post title, it does also include the post id, also the Subject field after the Re for any replies, returns post ids as well.
> You can also use this "re" link to re-create threads - even differentiating between threads of the same name or you can use the "whole thread" link.

To scrape the data, the post url in the format : boards.fool.co.uk/13456940.aspx is all you need; as the results of this will give you everything you need.

OK, I can see where your disagreement is coming from. Effectively, you are suggesting that the "post id" of the first submitted post in each thread could be used as the "thread id" of that thread. That would work.

Gromley wrote:From memory though, scrapeing the boards is quite a slow process (or that may have been my internet connection) - so if that option has to be the one to pursue[*] it would be best to start that early and/or to do it from with Fool towers.

I agree wholeheartedly with this. As I have noted elsewhere ( http://boards.fool.co.uk/if-tmf-history ... 57328.aspx ) there are over 6.6m posts, excluding those on the company specific boards. There are possibly less than 100 days left before the boards potentially disappear off the web for good. So scraping would need to proceed at the rate of 66,000 posts per day, and much higher than this if the company boards are to be included.

Although RandomAmbler's scraping tool (see several posts earlier on this thread) is certainly quick no one is presently sure whether it is up to this rate of activity or indeed whether the server it is running from or the Fool server it is targetting will tolerate such a rate over sustained periods. In any event, as I'm sure its authour would acknowledge, it would need quite a bit of refinement to get it to the point where it could scrape to produce output suitable for loading into a DB (either directly or via intermediate text files).

gbjbaanb wrote:We can't all ask TMFTarantula for the data, so I'm going to leave that negotiation to Stooz and Clariman. Let us know if you want to delegate that task guys, but otherwise the balls' in your court.

Given the effort they've put in to getting lemonfool up and running, I suspect they will want to keep their focus and energies on developing this and giving it a sustainable future. Whilst its been a fantastic start, it is still early days. Given that they have taken the lead so far, I think it is right to view the ball as being in their court and I hope that they will give an early indication of whether they want to keep it that way or hand the historic data aspect onto others.

In addition, I think there is slightly more to do than just asking for the data. Spec's of what's required and, doubtless, quite a bit of technical stuff which is beyond my ken will be needed for a successful data dump exercise. Beyond that there is of course the task of making it all available over the web.

Meatyfool · #2596

One thing that has stood out to me on the lemon boards is that a surprising number of people have said how they liked the TNT boards despite their age and lack of attention.

This is one driver for me personally to go with a scrape rather than try to rebuild a database.

Whenever you want to look at an old post you get the tmf "experience". And that I believe will be important to many.

Gromley has eliminated one concern I had, if he scraped in order to have better search experience then that is fantastic.

What we need is a scraper that can be directed to run on a per board basis. Winhttrack can do this. I have been tinkering with it but unfortunately I'm lacking the time to make any real progress.

In regard to quantity of posts, lootp is 750k posts and lost is 600k. Do they really need to be scraped?

Meatyfool..

Gengulphus · #2687

Clariman wrote:My personal view on it is that importing legacy data into Lemonfool would bring problems of data volumes, copyright and technical issues.

I think the copyright problems should be reasonably straightforward, as long as TMF is willing to co-operate suitably. Basically, TMF's terms & conditions say that each poster has given TMF an irrevocable license to use the content the poster has supplied, in pretty much any way they wish, and that TMF can sub-license that license. So TMF grants Lemon Fool a sublicense on similar irrevocable, sublicensable terms, and Lemon Fool is entitled to use all the user-supplied content.

That does leave the question of the TMF-supplied content - but at least that's a matter of negotiating with one copyright holder, not with thousands of them, many of whom are probably untraceable! And most of it is a 'nice to have' rather than essential for understanding the user-supplied content.

Gengulphus

#2694

Hi All

I will discuss this all with Stooz. There seems to be quite a demand. Getting Moderation in place is my next priority though, having re-jigged the board structure and added more places for Fools (Lemons?) to congregate.

Clariman

NomoneyNohoney · #2733

I think we should remain collectively as Fools, both as homage to our origins and for continuity with the new site.

melonfool · #2757

Meatyfool wrote:In regard to quantity of posts, lootp is 750k posts and lost is 600k. Do they really need to be scraped?

Meatyfool..

I would say no, but then I don't post on them.

I was quite looking forward to some of my older posts disappearing into the ether!

Mel

Gromley · #2979

Clariman wrote:No problem. My personal view on it is that importing legacy data into Lemonfool would bring problems of data volumes, copyright and technical issues. Linking to a repository or archive would be fine. I'd prefer to focus on keeping and growing this community and storing the new wealth of knowledge that it will create.

That said, there are others looking at this who have asked to speak to Stooz and me. I'm open minded and I think Stooz is going to speak to them.

C

Quite understand, and agree with, the desire to focus on growing lemonfool as the future. And in fact there is no reason at all that the creation of a lasting archive of all the TMF posts needs to be directly linked to the development of lemon fool. Perhaps someone else could lead on this aspect and host it on a different site?

I'd be happy to volunteer to take that on, although to be perfectly honest I may not be the best qualified to do that, I haven't looked at my old Fool-scraper-tool for probably about 8 years and to from memory I learnt most of what I did from others on TMF (I think it was Itsallaguess & I recall a useful conversation with Genluphus on the subject.)

So if we want to make "the archive" a separate project, I'd be happy to either lead or support. The key thing though is that time is of the essence, so we need to agree a co-ordinated response pronto.

Grateful for any views.

Regards,

Gromley

#3030

Whilst you may want it, my experirnce is a very small percentage will actually be accessed, and even then the depth in age will be very shallow.

Having said that I've already downloaded/scraped the lot.
I just don't have permission to use it.

modellingman · #3097

stooz wrote:Having said that I've already downloaded/scraped the lot.

Respect, man,! Respect!

The Lemon Fool

Donate to Remove ads

Got a credit card? use our Credit Card & Finance Calculators

Update on TMF data repository

Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Re: Update on TMF data repository

Who is online