We have a very large amount of relatively small files (~5k avg , 41k max , 0k min) , that we access a lot (20M times a day) for various computations. Currently , all the data is stored on a single server - a very ad-hoc solution that was OK until now , but is no longer acceptable - in terms of Service level , redundancy , backup , and so forth. We add approximately 5K new files a day , and data is never deleted .
We want a system that supply:
- High availability
- Easily scalable
- Inbuilt support for /mechanisms of - Backup & Replication
- Load balancing
- Fail-over mechanism
- A minimal Downtime is acceptable (90% SLA , not 99%. We can make-do without the system for short , non-frequent periods of time )
We reviewed several options , and the 3 finalist were
- Using TerraCotta and EHCache ( + an in-house layer of hash to map files to specific machines)
- Hadoop + HBase
EHCacheServer sounds like the perfect solution , but we found so very little information about it that it spooked us a little. We know, use and like the EHCache solution , but the lack of samples and more documentation on the server is alienating.
Using TerraCotta and EHCache ( + an in-house layer of hash to map files to specific machines) : some of us still think this might be a better solution , but the general feeling is that forcing the TC solutions to what we want is too far from the TC main idea , and it bound to effect offer efforts.
Hadoop + HBase seems like it was literally built for us : it has High availability , it’s easily scalable , the Hadoop replication mechanism is a great Backup & replication solution and the replication also enable easy load balancing .
What did trouble us is the lack of Fail-over mechanism. However , we decided that since HAdoop & Hbase do have partial mechanisems to support fail-over (The secondarynamenode , FS Image , and so forth) , we think we can build our on Fail-over mechanism using Linux Heartbeat.
So we set off to start the Hadoop & HBase solution benchmark. This is it’s story