Every once in a while when I need to develop a database for a client and I know that the dataset is going to be massive, I don’t even bother with Microsoft’s SQL or MySQL, I jump directly to Amazon’s DynamoDB. It’s by far the fastest database that I’ve ever used, in part because you specify (and pay for) how many reads and writes per second you need it to operate at and Amazon spreads your data accordingly across how many ever servers needed.
One of the drawbacks of DynamoDB however, is that it’s completely non-relational so you can’t perform operations such as joins, sorts (except on indexes), etc. Also, you can only query on indexes. To perform the more advanced operations, you need to write the operations yourself within your application.
The benefit of this is you can throw as much processing power as you can afford at your specific dataset and crunch massive amounts of data quickly using techniques such as MapReduce clusters.
My typical strategy since I mostly don’t need to analyze the data but for nightly, weekly or monthly is to adjust the write speed to just enough to ensure the tables can collect data as needed and keeping the read speed completely zeroed out until it becomes time to do the analysis. At the start of the analysis, I programmatically increase the read speed to how fast I need it, spin up my MapReduce cluster (again programmatically), process the data, release the cluster and zero out the read speed again. Since Amazon charges for resources by the hour, this is a great way to keep costs down, after all the only time you need the processing power is when you’re actually analyzing the data.
As computing power seems to get cheaper by the day, I really think that NoSQL databases like DynamoDB will get more and more popular.