Benchmarking Amazon EC2: The wacky world of cloud performance

Before turning to the world of cloud computing, let's pause to remember the crazy days of the 1970s when the science of the assembly line wasn't well-understood and consumers discovered that each purchase was something of a gamble. This was perhaps most true at the car dealer, where the quality of new cars was so random that buyers began to demand to know the day a car rolled off the assembly line.

Cars built on Monday were said to be a bad risk because everyone was hung over, while cars slapped together on Friday suffered because everyone's mind was already on the weekend. The most paranoid buyers would scrutinize the calendar to see if the car emerged from the factory when midweek descents into bacchanalia like the World Series would fog the brains of everyone with a wrench.

Inconsistencies like that are largely forgotten by today's shoppers thanks to better testing, a greater devotion to quality, and a general exodus of manufacturing from places where people stay up too late drinking too much while watching too much sports. If anything, the modern factories spit out identical items with a consistency that borders on boring. When modern movies show an assembly line, it's often a charisma-free mechanism stamping out perfect duplicates.

Those memories of the 1970s came back to me when I started running benchmark tests on cloud computers. While computers are normally pumped out with such clonelike perfection that adjectives like "soulless" spring to mind, I started to discover that the clone metaphor may not be the best one for the machines available for rent from the compute clouds. The cloud machines are not all alike, produced with precision like the boxes that sit on our desks. They're a bit like those cars of the '70s.

I started learning this after several engineers from the cloud companies groused at the benchmark tests in my big survey of public clouds. The tests weren't sophisticated enough, I was told.

My cloud benchmark results were never meant to be a definitive answer, just a report on what the user might experience walking through the virtual door. Like Consumer Reports, I signed up for a machine, uploaded Java code, and wrote down the results. While the DaCapo benchmark suite has advantages such as processor independence (Java) and a wide collection of popular server-side code (Jython, Tomcat), the results were scattered. The only solution, I concluded, was to test the very code you plan to run because that's the only way to figure out which cloud machines will deliver the best performance for your particular mix of I/O, disk access, and computation.

One engineer didn't like this approach. He started asking pointed questions that were surprisingly complex. What was the precise methodology? Which operating system did the tests run upon? How often were the tests run? Did the tests control for the time of day?

Time of day? Yes, the engineer said. He really wanted to know whether we were paying attention to when we run the tests. Sure, operating systems are an obvious source of performance differences, and using the latest drivers and patches always makes sense. But he also wanted to know the time of day.

This was new. CPUs have clocks that tick extremely quickly, but they'll stamp out computation at the same rate morning, noon, and night. Was it really important to watch the time? Yes, he said. In other words, the cloud machines may be more like a '70s-era Detroit assembly line than a Swiss watch.

Cloud Camaros and Corvettes

To find out just how variable the cloud might be, I fired up instances on Amazon's EC2. I chose Amazon because it's one of the industry's leaders with one of the most sophisticated clouds, but the conclusions here are probably just as true for any of the other clouds -- they're all playing the same game. They're building a superfast rack of computers and finding a clever way to share it with many users. Even though the sales literature makes it seem as if you're buying another computer like the one sitting on your desk, it's usually closer to buying into a time-share for a condo at the beach.

At this point, I should pause and put aside these glass-half-empty allusions to Detroit and real estate time-shares. A gracious defender of the cloud would say that sometimes you get what you pay for, but often you get lucky. On a bad day, you could end up sharing a CPU with 10 crazy guys trying to get rich minting Bitcoins by melting the processor; on a good day, the CPU will give you many extra cycles for free. You could end up sharing a machine with 10 grandmothers who only update the Web pages with a new copy of the church bulletin each Sunday. In other words, the CPU glass is not only half-full, but the bartender will often top off your tumbler with the last remnants of the bottle too.

The results were surprising. I purchased two kinds of machines: the low-end T1 Micro instance and the medium version called appropriatedly an M1 Medium. The Micro is a test machine with 613MB of RAM and a promise of "up to two EC2 Compute Units (for short periodic bursts)." The Medium comes with 3.75GB of RAM and a promise of one "virtual core with two EC2 Compute Units." The Micro is listed at 2 cents per hour and the Medium costs 12 cents per hour. Both of these prices are the list for an "on demand" instance available to anyone walking in the door. Lower prices are available for those who make a longer-term commitment with a reservation.

In both cases, I started with the stock 64-bit Amazon Linux machine image that comes preloaded with a customized version of OpenJDK 1.6.0_24. After updating all of the packages with Yum, I downloaded the DaCapo benchmarks and started them running immediately.

Micros, Mediums, lemons

In all, I ran the benchmarks 36 times on 10 different machines (five Micro and five Medium). In general, the Micro instances were much slower than the Mediums, but the Micro actually ran faster more than a few times. This generally happened soon after the machine started, probably because EC2 was allowing the machine to "burst" faster and run free. This generosity would usually fade and the next run would often be dramatically slower, sometimes 10 times slower -- yes, 10 times slower.

The performance of the Micro machines varied dramatically. There was one instance that could index files with Lucene in three different runs of 3.4, 4.0, and 4.1 seconds; it was predictable like a watch. But another instance started at 3.4 seconds, then took 39 seconds for the second run and 34 seconds for the third. Another instance took 14, 47, and 18 seconds to build the same Lucene index with the same file. Micro's results were all over the map.

In the DaCapo tests, performance of Amazon's T1 Micro instances was everything but consistent. (Download a PDF of the T1 Micro results.)

To make matters worse, the Micro instances would sometimes fail to finish. Sometimes there was a 404 error caused by a broken request for a Web page in the Web services simulations (Tradesoap and Tradebeans). Sometimes the bigger jobs just died with a one-line error message: "killed." The deaths, though, weren't predictable. It's not like one benchmark would always crash the machine.

Well, there was one case where I could start to feel that death was imminent. One instance was a definite lemon, and its sluggishness was apparent from the beginning. The Eclipse test is one of the more demanding DaCapo benchmarks. The other Micro machines would usually finish it between 500 and 600 seconds, but the lemon started off at more than 900 seconds and got worse. By the third run, it was pushing 2,476 seconds, almost five times slower than its brethren.

This wasn't necessarily surprising. This machine started up on a Thursday just after lunch on the East Coast, probably one of the moments when the largest part of America is wide awake and browsing the Web. Some of the faster machines started up at 6:30 in the morning on the East Coast.

While I normally shut down the machines after the tests were over, I kept the lemon around to play with it. It didn't get better. By late in the afternoon, it was crashing. I would come back to find messages that the machine had dropped communications, leaving my desktop complaining about a "broken pipe." Several times the lemon couldn't finish more than a few of the different tests.

For grins, I fired up the same tests on the same machine a few days later on Sunday morning on the East Coast. The first test went well. The lemon powered through Avrora in 18 seconds, a time comparable to the results often reported by the Medium machines. But after that the lemon slowed down dramatically, taking 3,120 seconds to finish the Eclipse test.

Up from Micro

The Medium machines were much more consistent. They never failed to run a benchmark and reported times that didn't vary as much. But even these numbers weren't that close to a Swiss watch. One Medium machine reported times of 16.7, 16.3, and 17.5 seconds for the Avrora test, while another reported 14.9, 14.8, and 14.8. Yet another machine ran it in 13.3 seconds.

Some Medium machines were more than 10 percent faster than others, and it seemed like they arrived with the luck of the draw. The speed of the Medium machines was consistent across most of the benchmarks and didn't seem as tied to the time of day.

The performance of the Medium machines also suggests that RAM may be just as important as the CPU cores but only if you need it. The Medium has roughly six times more RAM than the Micro; it costs, not surprisingly, six times as much. But on the Avrora benchmark, the Micro machine often ran faster or only a few times slower. On the Tomcat benchmark, it never ran faster but routinely finished between four and six times slower.

Performance of Amazon's M1 Medium instances was much more consistent. Unlike the Micros, the Mediums never failed to complete a test run. (Download a PDF of the M1 Medium results.)

In other cases, the Micro just melted down. On the Eclipse test, the Micro was occasionally about five times slower than the Medium, but it was often eight to 10 times slower. Once on the Eclipse test and several times on other tests, the Micro failed completely. The lack of RAM left an unstable machine. (Note that these several failures don't include the dozens of failures of the lemon, which crashed much more consistently than the other Micro instances I tested.)

These experiments, while still basic, show that packaging performance in the cloud is much trickier than with stand-alone machines. Not every machine will behave the same way. The effect of other users on the hardware and the network can and will distort any measurement.

The M1 Medium turned in consistent numbers on some DaCapo tests (such as avrora, above), but not-so-consistent on others (such as the eclipse test, below).

Between a rock and a slow place

To make matters worse, cloud companies are in a strange predicament. If they have spare cycles sitting around, it would be a shame to waste them. Yet if they give them out, the customers might get accustomed to better performance. No one notices when the machines run faster, but everyone starts getting angry if the machines slow down.

Such expectations make it harder for the companies to roll out newer hardware. If new chips that are 10 percent faster appear, those racks will run 10 percent faster. Everyone will be happy until they get stuck with an instance on the older racks.

Cloud users need to adjust their expectations or at least relax the error bars around the expectations. The cloud doesn't deliver performance with the same precision as the dedicated hardware sitting on your desk. It's not necessarily a good idea to demand it either, because the cloud company's only solution to a demand for precision is to eliminate any kind of bursting altogether. They would need to limit performance to the lowest common denominator.

In other words, the '70s may not have been the best years for Detroit's consistency, but they still turned out some great cars. The Camaros, Mustangs, Trans-Ams, and Corvettes often ran quite well. The assembly lines weren't perfect, but they were good most of the time -- until they produced a lemon. Like owners of those old muscle cars, drivers of cloud machines should keep a close watch on performance.

This story, "Benchmarking Amazon EC2: The wacky world of cloud performance" was originally published by InfoWorld.


Copyright © 2013 IDG Communications, Inc.

7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon