New year; renewed attempt to kickstart blogging. 2014 has been quite awesome for me. I’ve gone deeper and deeper into data – data engineering, data science, data analytics, data munging, R, Spark, Cassandra, machine learning, statistical inference… yum yum…yum yum yum. I’ve been meaning to write about these lovely adventures, but ran into perfectionist syndrome. To hell with it… I’ll put out all the imperfections coz why not? First up, Cassandra. Here’s a 100 feet overview of Cassandra. Yes, 100 feet…not 10,000 feet.

Cassandra is a partitioned row store. Yes, you read that right…it’s not a column db like [insert random NoSQL names here that are column stores]. It’s a partitioned row store. What does that mean? It means Cassandra stores data in “rows”, and these “rows” are stored in “partitions” across your cluster. Ooh…”cluster”…yes, Cassandra can run on a single node, but realistically, you’ll be looking at running multiple nodes in a cluster. At least five (although you can get away with four), to benefit from its fault tolerance. More on that later. Or now.

Cassandra has fault tolerance and distribution built in. You get it for free. Well, sort of. If you listen to what it wants. Cassandra’s fault tolerance is different to, say, Mongo. Or any other Master-Slave replication system. It’s fault tolerance comes from its Dynamo roots. Oh, and since I’ve not mentioned this, Cassandra’s essentially a bastard mutt derived by crossbreeding Dynamo with BigTable, which has gotten quite awesome in its own right. So where was I…yes,  fault tolerance. Unlike mongo, Cassandra’' clusters are master-less. Each node is an equal member of the cluster. There are no masters here. Remember I said Cassandra rows are stored in partitions? Well all rows in a partition are stored together on one node (and its replicas). Replicas? Didn’t I say there were no masters? Well, there aren’t. But for fault tolerance in a distributed system, data does need replication. Cassandra manages this for you as well. And it does so based on your table definition. When creating a table, you give it a replication factor (RF). This is the number of times each partition (and as such each row in said partitions) are going to be replicated in your cluster. RF = 1 means each partition for the table will be on one and only one node. Note, this doesn’t mean all rows are going to go to the same node. Your rows will be hashed based on a “partition key”, and scattered across your cluster. An RF = 2 means each partition will be on two nodes. RF=3 means they’ll be on three nodes. Since the data is replicated automatically, you can survive nodes going down. We’ll go into details about consistency, and read and write consistency levels in the future; for now, I’ll just say it’s possible to get very good – near perfect – consistency for most needs. Even when dealing with failures…to a certain extent. A server going down for SQL Server – even if replicated – is a headless-chicken-mode phenomenon. Not to mention the fact that it’s a headache to set up SQL replication in the first place. Enough about SQL…it has it’s uses. I’m not writing about SQL right now. A server going down for Cassandra (usually) isn’t an immediate concern. If it goes down at 3am, depending on your cluster, you may wait till the next day or the day after to fix it. Or blow it away and add a new server to the cluster. Cassandra will self heal. It’s nice that way. (Riak is similar in that respect).

Cassandra has insane ingestion performance. It works off a commit log. Writes go there, and updates in mem data and immediately returns. This leads to some crazy numbers for writes. While maintaining durability. You know…not losing writes, and the other party tricks some datastores pull to improve benchmark scores (I hear wired tiger is pretty good, btw). Read performance is also very very good…as long as your schema is good for your queries. Usually, if your schema isn’t right, you won’t have slow queries – you just won’t be able to query what you need at all. You’ll either get errors, or timeouts. Cassandra simply won’t allow you to do queries that don’t play nice in a distributed environment. Think about it… try joining 2 500GB tables across a five node cluster and expecting very fast performance. Maybe if the stars align, and you have an index which uses very little disk space per entry….. yeah…right.

Schema? SCHEMA? Did I say SCHEMA? Yes, yes and yes. Cassandra “requires” you to define schemas for your tables. This is actually a good thing. Your data always has schema, whether you declare it or not. Having a schema allows various optimisations, sanity checks, tooling, etc. etc. Think static vs. dynamic languages. Oh… so you like Clojure? Or maybe Erlang? Well, feel free to implement a compiler by writing an incomplete set of tests. And I hear dialyzer is quite important these days :P Cassandra does require schemas. And tables are “at a glance” similar to SQL tables. However, Cassandra rows can have various types of columns, including collection types like lists, sets and maps. And all of these support user defined types (UDTs). So, you can have a Person table with an Addresses column which is a map<text, address>, with entries for “home”, “work”, and “billing”. This gives you a lot of the flexibility of “schemaless” NoSQL, while maintaining schema support.

Cassandra comes with its own query language – CQL (almost narcissistic that). It’s kind of like a sub set of SQL, and appears similar. At first. And then not so much, but a bit. For example, you can have a query:

SELECT * FROM Person where id=1234 and postcode=’SE10 0QL’;

However, you can’t do any joins. Range queries are limited. Secondary indexes exist, but unless you hit a partition, aren’t really that great (unless using Spark). So, querying capabilities are a lot more restrictive than in SQL. However, if your schema meets your needs…oh boy will it be fast! This leads to the concept of query driven modelling. Unlike relational modelling where you come up with an ERD, in Cassandra you usually go through three “levels” of modelling to ensure your queries are met with the lowest number of tables necessary (but no lower). I’ll go into query driven modelling in the future.

Cassandra 3.0 will solve some of these concerns… global indexes, UDFs are things to look forward to. But some things won’t really work. Like joins.

Cassandra can be used as an OLTP database. Fast writes, fast reads, fault tolerance, abstraction via query language….why not? However, analytics workloads, ad hoc queries, etc. won’t really work well at all. Some times you can get to what you want by being clever, filtering on the client, and what not. But for joined up data, data munging… OLAP… Cassandra on its own won’t really cut it. This is where Spark comes in. Apache Spark is a memory processor that runs stand alone, on Mesos or Yarn. And it has a kick-arse Cassandra connector. Yes, currently the Spark Cassandra connector requires Scala – but only basic Scala. And it’s not too bad. Even if it uses implicits (eugh!). Spark lets you do some pretty cool stuff with Cassandra data (and data from other sources – Hadoop, Sql, files, etc. etc.). You’ll likely have a large amount of data in Cassandra to meet your OLTP queries. For OLAP queries that Cassandra can’t fulfil, there’s Spark!

Cassandra has connectors in various languages. I’m currently doing a lot of work in Scala and C# / F#. The Java and .NET connectors are quite good (the .NET one has improved MASSIVELY last year). It’s a synch to use. Really. Once you know what’s going on behind the hood.

Oh yeah… one nice way I’ve heard Cassandra explained is that it’s a map of sorted maps (or in .NET terms, a Dictionary of Dictionaries where the inner dictionaries are sorted based on key).

Use cases are quite varied. It works very well as a key value store, or a more advanced key value store (map of sorted maps!), for time series data, for storing events, aggregates (DDD terminology), viewmodel store, etc. etc. It fits well in a lot of scenarios, and that’s one of the reasons I like it so much. It doesn’t work very well, however, if you plan on storing some data only once in one table and running arbitrary queries on it (like you can do in Sql). If you want relational data with relational constraints and consistency, use a relational database ;)

 

So, in a nutshell – very fast partitioned row store. Has schema and query language. Has fault tolerance. Lets you do cool stuff. Can do even more cool stuff with Spark. Accessible (easily) on many platforms.

Anyway, that’s enough of an overview. In future posts, I’ll target specific areas, and answer common questions.