Depending on who you're talking to and how much they like to debate semantics, data mining and machine learning are essentially the same thing. The important point is that incredible amounts of data are needed in order to mine golden nuggets of useful information. The Netflix dataset used for the Netflix Prize, for instance, is well over 4GB; to make matters worse, unless one writes a lot of skillfully efficient code, putting this into a data structure takes quite a bit more space. On top of that, raw data isn't very useful unless you have room to fit whatever models you're trying to construct in addition. Despite my relatively short presence and shorter sentience on this Earth, I remember a time when 4GB was a huge capacity for a hard disk. Of course, these days 4GB will fit on the increasingly outdated optical DVD format... 4GB can even fit on a plastic sliver of flash memory, less substantial than a humble dime. In a time of terabyte hard drives costing less than a trip to the grocery store, 4GB seems laughably diminutive. However, there's a significant issue here! Most people know that a computer has several types of memory: RAM and a hard drive (there are more, to be covered momentarily). Why are there two types of memory, why not just use a hard drive? The answer is simple, getting data from a hard drive takes 100,000 times longer than from ram! While a specially built computer could run with only a hard drive, it would be so unbelievably slow that nobody in their right mind would ever use it. For a good number of tasks, like listening to music and looking at pictures, a hard drive works just fine. The reason is that these things are just read--once they've been read and used, say the sound the data represents has been sent to the speakers, the data can be thrown away. However, the more important, invisible bits of data that allow a computer to run are most often handled very differently: once read and processed, the results are stored so that they might be used later. Imagine, for instance, that you have a counter (which are extremely common in computers and programming) that counts the numbers of mouse clicks. If the processer were to take the stored counter, add one, and throw the result away, then the next time the processor read the counter it would get the number that the counter started at. If you have a program that displays some message when you click 10 times, the message will never get displayed. Obviously, in order for your program to work, the cpu must be able to remember how many clicks have happened, so it must read and write. For this simple example, the time it takes to read/write from a hard drive is ok--even the fastest human clicker is inconceivably slow compared to the inner workings of a computer. However, if this count is something that is read/written millions of times a second, the time it takes to access the hard drive will be an incredible bottleneck. In fact, this kind of situation is extremely common in computers (hard drives are the biggest bottleneck in a computer), hence why we have ram; computers simply need a place that can be accessed very quickly in order to work fast enough for us not to prefer watching grass grow. For a bit of extra credit, let me point out that as far as the cpu is concerned, even ram is dreadfully slow. See why after the jump.
Lets consider a 3 gigahertz processor. 3 GHz means that the processor runs through 3 Billion cycles per second. One cycle doesn't represent one calculation, in fact the relationship is so complicated that even I'm not very interested in the details, so lets assume that it takes 3 cycles to complete one calculation (technically I mean instruction, but for our purposes calculation will suffice). This means that the processor can only do a Billion calculations every second, pretty slow eh? Now talking about a billionth of a second can get tiring, but fortunately there's a shortcut: a billionth of a second is also known as a nanosecond (ns), so our cpu can do one calculation per nanosecond. Accessing ram takes about 60 ns, but what does this mean? Lets suppose our computer is busy for one whole second with a simple task: read a value from ram (60 ns), add one to it (1 ns), write the result back to ram (60 ns), and repeat. This task will happen about 8,264,462 times (1s/121ns), and for about 99% of that 1 second the cpu will be twiddling its digits, doing nothing, waiting for the ram. Clearly having a computer waiting 99% of the time is a tremendous waste! This is why there is another, even faster type of ram built right onto the cpu chip (actually several types, known as L1, L2, and sometimes L3), amongst a number of other technical and generally uninteresting efficiency boosts. We might be inclined to think it obvious that all our ram simply be L1, which would be a great idea if it weren't for the fact that L1 is at least 10,000 times more expensive than hard drive space.
Moving on, my very long winded and educational prelude is meant to suggest that moving the Netflix data back and forth between a hard disk is not an option--even very efficient algorithms could take decades to complete! Recently my personal adventures in machine learning (my academic adventures take place on a box that is adequately equipped... for now. For some things there will never be enough ram) have become bounded by my "limited" amount of ram, and as such I've been looking into some beefier hardware.
I've spent some time pricing out various configurations and comparing options (fortunately looking is free). Not surprisingly, building my own saves substantially and gives a lot more flexibility. Currently the most sensible route is buying a server board because that will allow me to upgrade in the future; the board I have in mind can hold two processors, but only needs one to run, which is key. By only buying one cpu, I can maximize the ram for that chip, leaving the option to double the power of the computer down the road when/if prices drop. Interestingly, as part of my research I checked how the prices of my current computer have changed (something I've done every so often since the purchase), and the results have some surprises. For one, almost everything I bought has been discontinued, except for (unsurprisingly) the case, keyboard, fans, and (surprisingly) the ram. The price of my ram has fluctuated over time, sometimes more expensive and sometimes less; currently it's 13% cheaper, indicating that I managed to purchase it near its cheapest, but also that there is a lower limit for the price of some things. This has an interesting implication, which is that the price of the proposed components might not drop, but it's a moot point: except for the very low risk of a substantial increase in price, there is only the chance of it staying the same or dropping, both indicating that waiting is a good strategy. I mention this because my second most attractive option (especially in the Tim the tool man Taylor way, 16 cores... *grunt grunt*) is a 4 processor board using last generation tech which required fairly close analysis for comparison. The advantage is that it has more power and a higher maximum capacity for ram, at a presently lower cost--the cpus cost half as much and the ram is slightly cheaper. However, the disadvantages are that the prices most likely won't drop much, the time components for upgrades will be available is probably rather short, and practically speaking more cpus means less ram per cpu which I gather would have an undesirable performance impact (task distribution amongst different cpus, even between different cores on the same cpu, is presently one of the most pressing challenges in computer science).
As it stands, the system I've imagined has an Intel Xeon E5520 (Nehalem) quad core and 36 GB of triple channel DDR3 1066 ram (9x4GB FB-DIMMs) on a dual socket Tyan motherboard with a maximum of 144GB of ram for the glorious day when 8GB DIMMs cost a fraction of their current >$1000 price. The price I've projected includes an average figuring (since I spent enough time comparing the important bits to not want to conclusively examine the boring bits) for the cost of a 1000+ watt non-redundant power supply, E-ATX case, and single hard drive. And the total is....... $2500, which, if you ask me, is surprisingly agreeable for the power and upgradeability of the included hardware. The cost of ram is about 43% of this total! For comparison, a Dell PowerEdge T710 workstation similarly equipped but with only 32GB of ram costs about $3000, but comes with support (pssh, tech support is for noobs) and a warranty (just like individually purchased components), which comes with restrictions (you mean I can't crack open my box? Yeah right, get lost) and the capability of Dell to void it, leaving you high and dry (Boooo, Hissss!). Alternatively, a new, student-discounted MacPro with 2 cpus, required to reach the maximum amount of ram--32GB purchased after market for savings--costs $4230, with no further ram capacity upgrades possible.