Cloud and NoSQL: A use case
This article is inspired by the many questions I see in forums about what is the best language, NoSQL database, or cloud service to use today.
A frequent reply to these questions is “What is your exact goal and use case?”, which is something people often find difficult to answer. For writing a use case is a tricky exercise indeed. In which depth of details to go? What is meaningful and what is not? How to make it fit inside a forum post? And how to keep your startup's end goal private?
Another recurring and objective reply is “Experiment to find what works best for you”. Still, one who is new to the Cloud and NoSQL space cannot experiment with everything. Aiming to reduce one's experimentation space is legitimate, which brings back to the question about use case.
I therefore thought that I would share my personal use case, and how I decided to experiment with Clojure, CouchDB, Heroku and PostgreSQL before anything else. My hope is that it will help you formulate your own use case and put you on the track of finding your first tool set, in particular by showing you in which way a use case and a first technological bet connect together.
A word of warning: objectivity is not the point of this article. Everything that follows is personal and subjective. It is based on my very own experience (or lack thereof), understanding (or lack thereof) and intuition (or lack thereof). This is intentional. Making your first bet should involve a huge amount of your own subjectivity and intuition. Experimentation will bring objective answers.
In the end, writing about this is a risky exercise. At best I will change my mind. At worst I got it all wrong... but I don't think so.
Interestingly, I found that my needs could be expressed in pretty generic terms, and without revealing the details of what I am up to; so if you want to ask for help on a forum, there is a lot you can say without selling your soul.
My use case
I have a few projects in the pipe now, and here is my reality in a nutshell.
- 75% of my projects are data harvesting, aggregation and
analysis/learning projects.
- The shape of the data is data source driven / supplier driven. Data are structured, but heterogeneous in the sense that a same data item may come in different forms from different sources (e.g. same dataset for different national government offices)
- Data can be stored as text.
- I value rapidity of development, robustness, data and process redundancy, ease of data harvesting.
- Data harvesting and pre-processing should happen in the cloud, research and prototyping on an in-house box, and production crunching and deployment back in the the cloud.
- Research data on a in-house box should ideally mirror production data in the cloud.
- 25% of my projects are social web applications
- The shape of the data is application driven / consumer driven.
- I value rapidity of development, ease of deployment, robustness
- The user base will not be gigantic.
- The ideal software development environment should work for data exploration, prototyping and large scale deployment (think Matlab and .Net in one).
- Some projects may reach “big data” scale, so the platform should allow replicating and sharding data.
- Administration should be minimal, i.e. doable by one or two developers alongside development work.
- Managing several environments should be easy (dev, research, test, prod).
- Cost should be minimal on day 1, and I choose linear costs over logarithmic cost.
- The solution should be be part of a recognised and standard ecosystem (like Java, .Net, AWS, Azure, Hadoop).
- I want to have a good understanding of every bit of the platform on day 1. My IT experience is in desktop, server, and web development with Java, .Net, MS SQL Server and a bit of Oracle, and data analysis with SQL Server and Matlab.
If you are an online shop, a news publisher, an online RPG, or a S&P500 company, then bits of your equation should be different from this. You might have a supply of developers and administrators, you might be working with in-house data only, you might soon be snowed under petabytes of data, you might be streaming digital data.
A McKinsey analysis would be more structured and polished, but this use case is really good enough for the purpose of narrowing which technologies to start experimenting with.
What about the technologies I know most?
Technically, these are things that Azure, .Net, SQL server and Matlab would do for me. But I think they would benefit projects that are a bit more mature than mine, if only for the higher upfront costs that they entail. Also:
- I would prefer a single environment for research, development and deployment.
- My intuition is that deploying and administrating data and processes redundantly across cloud and a local environment will require more skill and time than I would like.
- I think my data will need a fair bit of work to normalise and store in SQL server; some NoSQL document stores might allow me to cut on data harvesting development.
- I won't use some of Microsoft's (and Matlab's) killer features: rich desktop/office applications, highly accessed tiered systems and transactional databases, corporate standards, corporate support.
“Plan A"
What follows is my first shot at an overall platform that could work for my projects, and reasons why. The choice is very much an overall one, and not just a choice of individual parts, and the process to arrive there was very iterative.
I will also describe technologies that I did put on the side for now. I may well end up using them as plan B if plan A doesn't work as well as I hope.
Software: Open Source
- reduced upfront cost
- freedom to experiment
Cloud infrastructure: Amazon Web Services
- hosts and connects a competitive ecosystem of services, many of which operate on a freemium model.
- vision of modular and elastic services.
Hosting: Heroku
- Runs on Amazon Web Services
- Clear deployment and scaling model; web & workers
- Hub for a wealth of add-on products that fit my needs, like databases, caching, messaging...
- Github based development and deployment workflow
- Minimal administration
- Supports Clojure applications
- Freemium model (same for many add-ons too)
Preferred to:
- EC2/RDS: too administration intensive for now. Will go there when I have clear needs for elasticity in the EC2 way.
- Rackspace and similar: too administration intensive for my needs.
SQL: PostgreSQL
- Heroku's choice of hosted database
- I feel at home there coming from Microsoft SQL Server: features, stored procedure (although in beta at Heroku at the time of writing)
Preferred to:
- MySQL: I am not feeling entirely comfortable with the modular storage engine system. I think it will take me longer to make InnoDB sing. I don't really need the MyISAM storage engine. "Editions" packaging adds complexity to an already complex space. One needs to know how to read between the lines of Oracle type marketing; my time and energy are in too short supplies for this (Acid test: I'll come back when Oracle's TCO is also compared to MySQL).
- Microsoft SQL server: too costly for now, and not straightforward to run in the AWS ecosystem.
- Oracle: too complex and costly for my needs.
- Amazon RDS: MySQL or Oracle
NoSQL: CouchDB
- Candidate for main data warehouse.
- I understand all of it, and see the benefit of all of it:
- Document database, JSON
- MVCC and Replication
- Map/Reduce and View indexing
- Web integration (REST, CouchApps)
- CouchDB's limitations are honestly and clearly explained; best in class in the NoSQL space in this respect.
- CouchDB's functionalities complement relational databases for
what I need to do.
- Focus on large and atomic reads and non competitive writes.
- Simple out-of-the box mirroring; subject to how well replication works over a slow connection.
- If I got it right, a JSON document could contain the code to map itself, which would be a killer denormalisation feature for my heterogeneous data (will write about this when I make it work)
- CouchDB feels like a mini-hadoop over JSON docs with Master-Master replication and Map-reduce.
- CouchDB is scalable enough for my projects
- Administration seems low and tuning straightforward
- Creating a new test or development database from live data is trivial.
- Creating data harvesting processes that are redundant and partition-tolerant should be trivial.
- Couchapps are a clever deployment model for simple web applications.
- Hosted by Iris Couch and Cloudant (Heroku add-on), which both have a freemium model.
- CouchDB is an Apache project, and a recognised actor of the big data world
Preferred to:
- SQL servers: I'd rather avoid normalising research data if I can get away with it, and replication may not be as easy. And there are indeed scenarios where document databases seem to offer better productivity and easier horizontal scaling (e.g. the typical blog post example).
- MongoDB. I don't need write speed for now. I don't understand Mongo's risks as well as CouchDB's. Mongo's sharding logic is easy to understand but seems fairly elaborate to implement, and doesn't fit my current needs as naturally as CouchDB.
- Couchbase. A marriage of CouchDB and Membase. I guess it should offer the best of the two products. What the product does and doesn't do is a bit confusing for now. Will definitely look at it, though, when a hosted Couchbase service appears.
- Hadoop: too cumbersome and too much administration for now. Maybe when I hit CouchDB's limit.
- Elastic MapReduce: Don't know my precise elasticity need. Maybe when I hit CouchDB's limit.
- Column oriented databases: too infrastructure and administration intensive, although I like how they handle sparse data.
- Other products I'll be watching: Riak (for link walking and mapreduce), and Cassandra (for speed and for its way of doing CP, should I need this later)
Functional programming: Clojure
- Best language I found to use all the way from research and prototyping to industrial deployment
- Compact, expressive and a delight to use
- Runs in interactive and compiled modes
- Wonderful devices for multi-threaded programming
- Deployability: Runs over the Java .Net and Javascript engines.
- Java interoperability.
- Straightforward and elegant web frameworks.
- Also: Incanter for stats, Clutch and Clojurescript for CouchDB, Cascalog for Hadoop
- Compact ecosystem (for now)
- Hosted on Heroku
Preferred to:
- Python: I think that functional programming will be a nicer fit to data analysis, web development and cloud deployment. Huge choice of libraries, but an ecosystem that feels a bit wide and fragmented, in particular with the ongoing transition from v2 to v3.
- Matlab: costly to deploy (although if I did a project for someone who already had Matlab then it would be the obvious choice).
- Scala and F#: for being multi-paradigm languages. For now, I prefer using a pure functional language (Clojure) and a pure object oriented language (Java) that can interoperate. This will force me to do things the right way and for best results in each language.
Object programming: Java
- Multi-platform OO language
- Costless development environment
- Libraries relevant to my projects (weka, gephi...)
- The language in which Hadoop is written.
- Huge hosting ecosystem. Hosted on Heroku.
Preferred to:
- .Net, as a result of all of the above
Summary
In a nushell, Plan A is very much:
a bet on the trio Heroku, Clojure, CouchDB
with PostgreSQL as
safe SQL bet
and Hadoop over AWS as a long term cloud scaling
environment.
What I need to assess now is whether Heroku and hosted CouchDB deliver good enough performance, and whether CouchDB meets all my high expectations (too high?).
To summarize how tools match the use case:
Data analysis projects | |
Heterogeneous and source driven data | CouchDB |
Data can be stored as text | CouchDB |
Rapidity of development | CouchDB, Clojure |
robustness, data redundancy | CouchDB |
Process redundancy | Heroku // in-house |
Easy data harvesting | CouchDB, Clojure |
Harvesting in the cloud, research in-house, production in the cloud | CouchDB, Clojure, Heroku |
In-house research data mirror production data in the cloud | CouchDB |
Web application projects | |
Application driven data | CouchDB, PostgreSQL |
Rapidity of development, ease of deployment | Clojure, Heroku |
Robustness | CouchDB, PostgreSQL |
Single environment from data exploration to deployment | Clojure, CouchDB |
Big data | CouchDB, AWS |
Minimal administration | Heroku, CouchDB |
Managing several environments | CouchDB |
Minimal upfront cost, linear costs | Open source and freemium |
Recognised ecosystem | AWS, Java |
Good personal understanding | Java, Clojure, CouchDB, PostgreSQL, Heroku |
Your story?
I hope the above will help you to formulate your own use case and to find technologies that fulfil it. In particular, I hope this gave you an idea of how advanced an analysis you can perform based on the information that can be found on the web, in forums and in books (and a little bit of experimentation of the side). I will love to hear your own stories, and to advertise use cases and solutions that are different from the above.
In the end, the one piece of advice I would give when approaching the space is: understand what you plan to use, and feel certain that you and your team can become intimate with it. For example: I don't think I can become intimate enough with EC2 in the short-term, so I am happily giving its flavour of elasticity up for now. I see myself becoming intimate with PostgreSQL more easily than with MySQL. I felt intimate with CouchDB soon after I started reading about it, and less so with other NoSQL solutions.
Now back to experimenting with all this. And I am looking forward (am I?) to telling you whether I change my mind or not!
Preferred to: