Friday, March 2, 2012

Understanding No-SQL and why is it important for middleware folks to know it!

Understanding No-SQL and why is it important for middleware folks to know it!

Team, this week’s post in my humble opinion is rather weird, I say this because in context of Application Infrastructure or many of the folks on this community who are middleware folks we see ourselves discussing Data!! But I am compelled to write this because this is becoming relevant – even more so when we are constantly being challenged with new architecture and design and middleware and “data centric middleware” are taking centre stage.

With Mobile/Social networks, there has been exponential growth in data --- this not new and we have now known this for several years. The challenge that we face now is how and what to do with this exponential growth? This past decade we have advised our clients to bring data closer to application for performance and scalability. So our clients have listened to us, and either due to necessity or by design made a conscious decision to make that transition. And NOW we are at this cross roads where Middleware meets Data…. So it is relevant!

So now broadly we should use the following technical criteria to distinguish the solutions:

  1. Data access and Query Method – Native API, MapReduce
  2. Language Support – Java, C++, etc
  3. Protocol support for an access and Integration – HTTP / JSON, REST etc.
  4. *Important* - Non functional requirements: Security, Scalability and Performance.

In general

  1. All of these solution address or claim to address Scalability and are highly scalable distributed database systems
  2. In memory access is faster than disk access- this is just Physics so any solution providing in memory access, will have comparable performance outcomes.
  3. Where WE should distinguish ourselves ( on Valid technology comparisons) is the ability for us to handle “Elasticity” --- The ability for us to dynamically handle failures, membership changes on instances hosting data, and Non functional requirements: Security, Scalability and Performance.

What is No-SQL?

This movement began around 2009, we have had OG/WXS since 2005-2006, so I am sure the seed are sown around the same time as emergence of In memory data grid or IMDG. So NO SQL I would say is Next Generation Databases mostly associated with terms like:

  1. being non-relational,
  2. distributed,
  3. open-source and horizontally scalable. Etc

The original intention has been to create highly scalable distributed database systems that can handle massive amounts of data without the traditional challenges of a traditional relational database systems (things like table scans, locks etc).

Often more characteristics apply such as:

  1. schema-free,
  2. easy replication support,
  3. simple API,
  4. eventually consistent / BASE (not ACID),
  5. a huge amount, of data and more…

Broadly this space is classified into the following:

In Memory data Base: This is essentially relational data, store in memory. Example SolidDB, TimesTen.

In Memory data Grid: Provides a set of interconnected java processes that holds the data in memory, thereby acting as shock absorbers to the back end databases. This not only enabled faster data access, as the data is accessed from memory, but also reduces the stress on database. WXS, Coherence, terracotta etc.

Wide Colum Store: As the name suggests DDMS that store data by columns, rather than by row. The advantage of this model is processing or computing over large data of similar items. e.g.: Hadoop, Cassandra, Amazon SimpleDB

Document Store: This is a DDMS which is designed for document oriented or semi structured data. This revolves around the notion of a document, which can be anything, like binary form of PDF, word doc etc, and encoded by an object notification system like JSON, BSON or even XML. e.g.: MongoDB, CouchDB

Key-value or Tuple: General purpose DDMS and like WXS can store data in distributed memory system. As name suggests data is stored as Key value pairs or tuples. One of the most popular DDMS systems and used by many popular names like YouTube, face book, zynga, twitter etc. e.g.; Dynamo, Azure, memchached, Berkley DB

Graph data base: This is an interesting paradigm, as it provides a storage system to store graphs. Which is index-free adjacency. This approach is faster to store associative data sets for example storing the group and its members. SO these systems typically do not require expensive joins. e.g.: InfoGrid, Big data

Object database: This is not new and has been existence since late 1970s. These are also called Object Oriented databases. These never did really take off, and in our context not really an immediate threat. e.g.: Objectivity, Gemstone

The Idea: By caching strategically at many tiers, we are trying to offload processing to various tiers and ONLY dedicating processing in middleware when it is most important or the ‘window shoppers’ now mean business, we will dedicate our cycles to those business meaning clients and service them better with an enhanced experience.

Challenge: I discussed the Design phase, the challenge is to ensure the application design that is modular enough to enable these various tiers of caching and still present a unified front, where the end user is oblivious of the inner working of the application that has its content derived from various layers. An intentional design will enable the content and business logic to be isolated, thus enabling caching at various tiers.

References:

  1. http://nosql-database.org/
  2. http://cassandra.apache.org/
  3. http://hadoop.apache.org/
  4. http://www.mongodb.org/
  5. http://www.slideshare.net/harrikauhanen/nosql-3376398
  6. http://couchdb.apache.org/