Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The main selling point of the various NoSQL products out there today isn't the schemaless storage, instead it's the ability to grow beyond a single server that's compelling.

228MB of data is nothing, it fits in RAM of any machine. What would the examples in this blog post look like if it was 228GB of data spread across 10 servers instead? How would you grow/shrink such a cluster? How would you perform a query that cuts across servers and aggregates data from some of them? How would you replicate data across multiple servers?



228MB of data is nothing, it fits in RAM of any machine. What would the examples in this blog post look like if it was 228GB

You bring up an interesting point. At what data size do the companies using the various NoSQL DBs feel they have to move beyond a traditional RDBMS? I work with some traditional RDBMS stores now that are >500GB in size with tables that add 20M-30M rows/month and querying still isn't an issue. Admittedly it takes expertise to optimize the system and make it work efficiently, but that's going to be the case with any datastore.


I can sustain this claim. I work with a system implemented almost entirely in Oracle PL/SQL. Some tables in the system are nearing 800-900 columns, their size often exceeds 600 GB per table (not many of such large tables though). Querying isn't a problem at all. Large schema changes are also mostly painless. The only point at which one has to be really careful is when a schema change requires actual calculations based on historic data with an additional write on each record.


A lot of criticisms I read about the immutable limits of RDBMSes turn out, upon closer inspection, to be criticisms of MySQL. Oracle sort of sails past these limitations like a superliner: expensively and gracefully.

Not that I particularly like Oracle as a programmer. I have to check my calendar every time I hit the 32-character limit for names or once again have to write

    some_field_expressing_truth  varchar2(1);
    constraint "y_or_n" check some_field_expressing_truth in ('Y','N');
But relational databases already scale to petabytes and millions of transactions per minute. Just not in the opensource world ... not yet, anyhow.


Well PostgreSQL does scale to millions of simple read-only transactions per minute, at least in some benchmarks. And in PostgreSQL 9.2 it will scale much better, in the synthetic benchmark it reached about 13 million per minute at high concurrency (compared to about 2.5 million for 9.1).

http://rhaas.blogspot.com/2011/09/scalability-in-graphical-f...


Hence "not yet".

It's been amazing watching the performance surge in postgres these past few years. I wonder if Red Hat or similar will try sponsoring a tilt at the TPC-C crown in future.


Back before Sun bought MySQL they started doing a lot of performance work on Postgres. They didn't go after TPC, but they did show it was only(?) 12% slower than Oracle for SpecjAppServer: http://www.informationweek.com/news/201001901


If you want something to play with that could scale to that size (although maybe not ready for production), see Postgres-XC (http://postgres-xc.sourceforge.net/). This looks promising.


The Wisconsin Courts system has servers running PostgreSQl that they say scale up to millions of transactions per day on 16 cores w/128MB RAM/


The trick is, you wouldn't necessarily. If your workload can be handled with one beefed-up lots-of-RAM-and-solid-state-drives server, you could spend your money on two of these instead of having 10 smaller servers, and be perfectly happy with it.

I don't care about what sells NoSQL products to enterprise users - there are lots of workloads where the data fits in RAM (for some reasonable quantity of RAM that money can buy) and you still prefer the durability and consistency that comes with standard SQL databases, even if you denormalize and/or use schemaless (XML or JSON) storage.


Each database worth its salt will use as much memory as you allow it to use in order to avoid physical IO.

Your point is valid of course. I just wanted to point it out for people that think each action in a RDBMS results in a costly IO operation.


various NoSQL products it's the ability to grow beyond a single server that's compelling.

You make it sound as if these products scale "magically". This is definitely not the case. There's quite a bit of massaging needed to make riak, hbase et al scale beyond certain thresholds - and you better pay close attention to their respective peculiarities.

How would you perform a query that cuts across servers and aggregates data from some of them?

That's actually not very hard to implement once you understand how it needs to be done. And you do need this understanding with any k/v-store, otherwise you'll be very sad when the magic stops working (as documented time after time for MongoDB).

Starting out with a known-good, mature store such as PostgreSQL can make a lot of sense when the trade-offs match your usage pattern.


You can buy a 2u server with 256GB of ram. Think about this.


The main selling point of the various NoSQL products out there today isn't the schemaless storage, instead it's the ability to grow beyond a single server that's compelling.

That's just not true. Some NoSQL products (HDFS, Cassandra etc) sell on the ability to easily scale out. Other (CouchDB, MongoDB etc) focus on other features. CouchDB (for example) doesn't have a real scale-out story at all (beyond "manually shard your data", or try http://tilgovi.github.com/couchdb-lounge/), but that isn't really a problem because it has other features that sell themselves.


There's now Couchbase which does the sharding for you.


I think for schema-less storage systems we know 2 major competitors in market. MongoDB and CouchDB. - CouchDB by default has not ability to scale out except master master replication, solution? Sharding for distribution and Replication for reliability. Or you BigCouch with prayers that it won't trash out your data. - MongoDB is know to stand on its Sharding server mongos and you have to issue sharding commands whenever you scaleout and rebalance. So its again Sharding for distribution and Replication for reliability! - Postgres -> Reliable in storage than both if you do same Sharding and Replication yourself :).

I am in no way substituting or saying 228MB is enough data! I would rather hate MongoDB for being RAM hungry and storing same 115k tweets in Gigabytes of memory (256MB is just like a started of 100 Course meal for MongoDB). Again Facebook ultimately prefers his data to go to MySQL and they have largest 600 shard for a reason!


> I think for schema-less storage systems we know 2 major competitors in market. MongoDB and CouchDB.

Um, what about Riak, Cassandra, Voldemort, and Hbase? (I'm sure there's a bunch more I'm forgetting)


Cassandra, Hbase -> Column Oriented! Not schema less! Riak -> Actually a key/value store with link walking, you can just write map reduce for that document oriented feel! Again I won't bother writing a map-reduce job just to fetch out document with particular values. Voldermort -> Distributed key value.

Again you are missing the point of maturity and a proven user base, and it's comparing apple with bananas! Try putting in same joins and relations in your NoSQL stores that you are bragging about and see how quickly they will lose scaling! Want an example? Neo4J!


> Actually a key/value store with link walking, you can just write map reduce for that document oriented feel! Again I won't bother writing a map-reduce job just to fetch out document with particular values

I thought CouchDB required this as well?


Schemaless doesn't necessarily mean "document-oriented." Sparse-column / ColumnFamily databases like Cassandra are a lot closer to "schemaless" than they are to "traditional rdbms schema."


What about handlersocket for MySQL ? you can have the joins and relations there if you need them.


> Riak -> Actually a key/value store with link walking, you can just write map reduce for that document oriented feel! Again I won't bother writing a map-reduce job just to fetch out document with particular values.

http://howfuckedismydatabase.com/nosql/

Quite apt.


Personally I find writing MapReduce jobs (in JavaScript no less) to be unbelievably clean and easy when your stack is Riak + Node.

Of course if you've been using SQL for years then this probably sounds difficult in comparison. Except what if you're a JS guy with zero SQL experience?

"Ok, it's a database. How do I query it?"

"You learn this completely new language and dynamically compile your questions down to it, but you have to be really careful because the process is notorious for being a first-class attack vector."

"Did you just tell me to go fuck myself?"

I'm not trying to say anything about the merits of SQL. I'm just pointing out that it's a matter of perspective.


I'm not a big fan of NoSQL, but sometimes I want to write a query and sometimes I just want to write some code, and I could see the appeal of doing it in JS, and especially some of the languages that target JS.

The thing about SQL as an attack vector is frustrating because it (usually) doesn't need to be: use prepared statements and let the driver handle value substitution for you. It's quicker and easier than escaping everything.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: