My friend Dekel Tankel sent me some information about the next Hadoop user group, which takes place at the Yahoo campus in Santa Clara on Wednesday, October 21, 2009. Sounds like an interesting line up and agenda:
6-6:15 - Socializing and Beers
6:15-6:45 - Mumak - Using Simulation for Large-scale Distributed System Verification and Debugging
Hong Tang - Yahoo!
Large-scale distributed systems such as MapReduce are notoriously hard to verify & debug. An effective approach to address many of these challenges is through simulation. In this talk, I am going to present Mumak, a MapReduce simulator, including its design and implementation, early experience how it is useful, and point out some future work.
6:45-7:15 - Cloudera Desktop in Detail
Philip Zeyliger, Cloudera
7:15-7:45 - Karmasphere Studio: A graphical IDE for Hadoop
7:45-8:00 Q&A and Open Discussion
Todd Hoff at HighScalability.com published another excellent article entitled Are Cloud-Based Memory Architectures the Next Big Thing? Chock full of analysis, data, links, examples and references. IMHO, it's a must-read piece for developers and architects.
As I noted in the comments to Todd's post, in addition to the many benefits of memory-based architectures which Todd lists, there is also the cost benefits in the cloud, which I discussed in Cloud Pricing and Application Architecture.
An excerpt from Todd's piece:
RAM = High Bandwidth and Low Latency
Why are Memory Based Architectures so attractive? Compared to disk RAM is a high bandwidth and low latency storage medium. Depending on who you ask the bandwidth of RAM is 5 GB/s. The bandwidth of disk is about 100 MB/s. RAM bandwidth is many hundreds of times faster. RAM wins. Modern hard drives have latencies under 13 milliseconds. When many applications are queued for disk reads latencies can easily be in the many second range. Memory latency is in the 5 nanosecond range. Memory latency is 2,000 times faster. RAM wins again.
RAM is the New Disk
The superiority of RAM is at the heart of the RAM is the New Disk paradigm. As an architecture it combines the holy quadrinity of computing:
Read the whole thing.
I have been giving a lot of thought lately to cloud pricing. As an adviser to companies from both sides of the issue -- cloud (IaaS and PaaS) providers and cloud users (and potential users) -- I've had an interesting perspective on the issue, which I will discuss in this and several future posts.
Here, I want to focus on how cloud pricing models (might) affect application architecture design decisions.
Even in traditional data centers and hosting services, software architects and developers give some consideration to the cost of the required hardware and software licenses to run their application. But more often than not, this is an afterthought.
Last May, Michael Janke published a post on his blog which tells the story of how he calculated that a certain query for a widget installed in a web app -- an extremely popular widget -- cost the company $250,000, mainly in servers for the database and RDBMS licenses.
From my experience, companies rarely get down to the level of calculating the costs of specific features, such as a particular query.
So while this kind of metric-based and rational approach is always advisable, things get even more interesting in the cloud.
In other words, whether you were planning on it or not, your real-time bill from your cloud provider will scream at you if a certain aspect of your application is particularly expensive, and it will do so at a very granular level, such as database reads and writes. And any improvements you make will have a direct result. If you reduce database reads and writes by 10% those charges will go down by 10%.
This is, of course, very different than the current prevailing model of pricing by discrete "server units" in a traditional data center or a dedicated hosting environment. Meaning, if you reduce database operations by 10%, you still own or rent that server. The changes will have no effect on cost. Sure, if you have a very large application that runs on 100 or 1,000 servers, tan such an optimization can yield some savings and very large-scale apps generally do receive a much more thorough cost analysis, but again, typically as an afterthought and not at such a granular level.
Another interesting aspect is that cloud providers may be offering a different costs structure than that of simply buying traditional servers. For example, they may be charging a proportionally higher rate for local disk I/O operations (Amazon charges $0.10 per million I/O requests to EBS). Something that would barely go into consideration when buying or renting discrete servers (whether physical or virtual).
Which brings us to the topic of this post. Cloud pricing models will affect architectural choices (or at least they should). Todd Hoff discussed this issue in a HighScalability.com post entitled Cloud Programming Directly Feeds Cost Allocation Back Into Software Design:
Now software design decisions are part of the operations budget. Every algorithm decision you make will have dollar cost associated with it and it may become more important to craft algorithms that minimize operations cost across a large number of resources (CPU, disk, bandwidth, etc) than it is to trade off our old friends space and time.
Different cloud architecture will force very different design decisions. Under Amazon CPU is cheap whereas under [Google App Engine], CPU is a scarce commodity. Applications between the two niches will not be easily ported.
Todd recently updated this post with an example from a Google App Engine application in which:
So what architectural changes can you make to reduce costs on the cloud? Here's one example:
A while back I wrote a post about GigaSpaces and the Economics of Cloud Computing. GigaSpaces has been -- for those of you new to my blog -- my employer for the past 5 years. I gave five reasons for why GigaSpaces will save costs on the cloud, but what I discuss above adds a sixth one. Because GigaSpaces uses an in-memory data grid as the "system-of-record" for your application, it significantly reduces database operations (in some cases a 10-to-1 reduction). In AWS, this could reduce significant EBS and other charges. It also happens to be good architectural practice. For more on that see David Van Couvering's Why Not Put the Whole Friggin' Thing in Memory?
Taking this approach as an example, it could have saved a significant portion of Michael Janke's $250,000 query off the cloud, and perhaps an even bigger porportion on the cloud.
If anyone has other ideas on how architectural decisions could affect costs on the cloud, please share them in the comments.
P.S. This post is another example of Why (and What) Every Business Exec Should Know About Cloud Computing.
Dekel Tankel sent me a link to a recorded demo of the GigaSpaces CloudTools that he and Owen Taylor prepared. The demo shows a stock ticker application updated in real-time running on Amazon EC2 (using dummy data). In the demo, Owen shows how the application can auto-scale based on increased load and how it can heal itself in case any of the Amazon instances fail.
I enjoyed reading Ted Dziuba's I'm Going to Scale My Foot Up Your Ass, I really did. I like the 'tude and I like the style of writing. Reminiscent of the BileBlog (RIP), which some of my colleagues think is juvenile, but I think is hilarious. I loved it all, except for one big problem: Ted is dead wrong on the facts.
Now, I agree with Ted that in many cases the problem is shitty code. But that's exactly the point. Developers should not have to code in a way that requires a degree in computational mathematics (did I get your degree right, Ted?) to get their application to scale.
And I also agree with Ted that memcached isn't the end-all to scalability problems. In fact, I will probably go on a memcached-related rant some other time. But I do think it is a move in the right direction.
Bottom line, Ted, linear scalability does exist and architecture is the problem. To get what I am saying check out this and then read through Nati's blog. If you disagree then, let's talk. Did I mention I work for GigaSpaces?
Back in June of last year I wrote about our partnership with Microsoft and our plans to work together on a solutions for scaling out computations on Microsoft Excel spreadsheets. Since then Microsoft and us both released joint material (see here on MSDN) and held joint events promoting the solution. The most up-to-date white paper on the solution can be found here.
But now, Owen Taylor produced a screencast that describes the Microsoft-GigaSpaces joint "Excel That Scales" solution, in which he walks you through the problem and the solution.
In many organizations -- for example in capital markets and oil & gas exploration -- Excel is used widely for complex computations and analytics. Excel is a flexible tool that many people are familiar with, so over time huge investments have been made in creating complex analytical models in Excel. However, it was never designed to be an enterprise-grade analytical tool. As data volumes are growing, the need to have real-time information is intensifying and the number of users who wish to share the same computational logic and data is increasing, desktop-based Excel spreadsheets could no longer handle the loads. Also, the functions they perform are becoming mission-critical and valuable time and information could be lost in case of failure.
Enter the GigaSpaces solution. It combines the best of both worlds: Excel as the front-end and the power of your data center -- through GigaSpaces as the scale-out, highly-reliable application server - -at the back-end. In other words, the logic and the data are handled server-side with enterprise-grade reliability and performance.
Although I usually let personal attacks slip by, I couldn't let this post from Cameron Purdy remain unanswered, because it's kind of shameless. And BTW, sorry it took me some time, but like Cameron, I too am friggin' busy with real important stuff :-)
So I'll start with Cameron's ominous threat -- in answer to my question Where is Oracle? he said: "We are in your customer accounts. Every single one of them." I don't know if Cameron is naive enough to believe this (doubt it), is just trying to sound threatening (shaking in my boots) or is just spreading what he thinks is FUD, but obviously the threat is not very credible.
We all know that Oracle is a mean-ol' sales machine (throwing in Coherence for free to close an ELA on other products is pretty aggressive). Nothing new there. But fortunately, the world is bigger than Oracle, and not only does it have competitors, but there are many companies out there who wouldn't let an Oracle sales person set foot. And we will be in every single one of them...
Interesting thing is that the Oracle announcement of the Tangosol acquisition actually helped us close faster a few competitive deals in so-called Oracle-shops, because customers said (and I quote one directly): "I don't want to put all my eggs in one basket." I just came back from a GigaSpaces off-site management meeting, and the sales execs could only come up with one case where we actually lost a deal because of the Oracle acquisition (and that was due to the aforementioned aggressive tactics).
Anyway, enough with the petty bickering (I'll get back to petty bickering later). I was making a bigger point: Oracle is not pursuing an innovation strategy, and is therefore losing its relevance for new applications.
In Where is Oracle? I quoted Nick Carr's analysis of Oracle's strategy, which ended with: "Through acquisitions and share gains, [Oracle] will milk the old model until the old model goes dry." Following the whole BEA situation, here's a perspective from Rob Hailstone of the Butler Group (via Java Developer's Journal):
Some 15 or more years ago I was in the audience at an IT conference, listening to Larry Ellison describing CA as a ‘bottom-feeder’.
Since I was working for CA at the time this rankled a bit, but it was certainly true that CA had evolved a good business model that included acquiring end-of-life software companies where there was a substantial user base in need of ongoing support.
The last few years has seen Oracle adopt a variation on this bottom-feeder business model – it doesn’t wait for the victim to be near the end, but puts the knife in at the first sign of weakness.
Well, I guess you can call that innovation.
Besides Oracle's own strategy, I don't think that anyone can be taken serious claiming that all's well for Big-Ass commercial databases in a Web 2.0, scale-out world. I phrased my statement pretty carefully: "They overwhelmingly use MySQL and in many case use some other tier - such as their own file system or an in-memory cache, as the system of record for session or transaction state."
I'm sure that somewhere in the bowels of the organization of eBay, Amazon and Google there is an Oracle database or two installed, but that's not the point.
Now, back to the petty bickering. I don't get why Cameron keeps saying that we use Coherence for our web site. Oh, now I get it. I guess because our Wiki uses Atlassian Confluence and Coherence is used as a cache for Massive Confluence, he concluded that GigaSpaces uses Coherence. Clever boy! But seriously, not only do we not use Massive Confluence, by that token, every time Cameron trades on the NYSE he uses GigaSpaces.
Oh, I guess he's just using the famous Geva Perry tactic of "it's OK to make up anything and post it on my blog". I like it. Aggressive, Oracle-style.
They examined several methods and reached the conclusion that the optimal approach is distributed objects and the master-worker paradigm.
Sounds familiar? :-)
Earlier this week, GridToday published an excellent piece by Marc Jacobs of Lab49 (a GigaSpaces partner) entitled Grid in Financial Services: Past, Present and Future. What an eloquent, well-written piece.
Each trading day is a perfect storm. Every month, every quarter, the volume of data increases, the sophistication of algorithms and business processes grows, and the competitive pressure to get things done as quickly and efficiently as possible mounts.
Marc reviews the state of the union on distributed computing in financial markets Including the motivation (we can no longer throw expensive hardware at the problem, we need a new approach) and the current state of affairs (distributed computing is real not academic, with commercial vendors and real implementations).
Our appetite for computing power isn’t satisfied with lone, uncoordinated machines. For financial services, distributed computing isn’t a luxury: it puts food on the table.
One of the challenges Marc observes is the fact that until recently (and still going on in some places) developers have been dealing with a lot of the "plumbing" issues of distributed computing and have been doing so in an inconsistent way. Part of the solution he sees is a new generation of vendors and products that address this:
The range of stable, usable distributed computing platforms -- such as those from Platform Computing, GigaSpaces and Digipede Technologies... Thus, it is becoming much rarer to find software development teams in financial services working on this type of plumbing.
However, one of the obstacles that Marc points out is the fact that various vendors are only dealing with one aspect of the issue or another, and rarely take and end-to-end approach. He has this to say, which is especially nice for GigaSpaces (my emphasis):
For example, while it is positive that there is a wave of vendor products that solve different parts of the distributed computing puzzle, few of them treat distributed application development as a holistic endeavor that encompasses many problems (i.e., job scheduling, event processing, data distribution and caching, security, deployment, APIs, IDEs, etc.) at once. Except for GigaSpaces, most distributed computing architectures require the assembly of infrastructure from several different vendors. While this does permit architectures built from best-of-breed solutions, it can be challenging to stitch the various pieces together into a coherent developer framework.
There are many other interesting topics, such as the need for both IT management of distributed systems as well as developer-friendliness. Again, he had something nice to say about GigaSpaces (and our friedns at Digipede):
Unfortunately, few vendors have been able to make progress on both fronts. Some products, such as Digipede and GigaSpaces, are clearly more developer-friendly than others.
Highly recommended read. BTW, Marc has a great blog, which I read regularly: Serial to Parallel to Distributed.
As an aside, the Lab49 guys are a very sharp, well-spoken, experienced group that's worth paying attention to (see their group blog). I recently had the pleasure of doing a web seminar with Daniel Chait (founder and managing director of Lab49) for CMP. You can see it here.
Update: Here's Tom Groenfeldt's take on Marc's piece.
Thinking Out Cloud is a blog about cloud computing and the SaaS business model written by Geva Perry.