Seattle downtown, space needle, Great Wheel.  Seattle downtown, space needle Gold Creek Pond Twin Falls (Washington) up Golden Gate -  Seagull

发表在 未分类 | 标签为 | 留下评论

Grep a Tab

In order to grep for a TAB you would need to you use the following syntax,
[root@Linux~]# grep $’\t’ file.txt
发表在 未分类 | 留下评论



Math For Programmers

I’ve been working for the past 15 months on repairing my rusty math skills, ever since I read a biography of Johnny von Neumann. I’ve read a huge stack of math books, and I have an even bigger stack of unread math books. And it’s starting to come together.

Let me tell you about it.

Conventional Wisdom Doesn’t Add Up

First: programmers don’t think they need to know math. I hear that so often; I hardly know anyone who disagrees. Even programmers who were math majors tell me they don’t really use math all that much! They say it’s better to know about design patterns, object-oriented methodologies, software tools, interface design, stuff like that.

And you know what? They’re absolutely right. You can be a good, solid, professional programmer without knowing much math.

But hey, you don’t really need to know how to program, either. Let’s face it: there are a lot of professional programmers out there who realize they’re not very good at it, and they still find ways to contribute.

If you’re suddenly feeling out of your depth, and everyone appears to be running circles around you, what are your options? Well, you might discover you’re good at project management, or people management, or UI design, or technical writing, or system administration, any number of other important things that “programmers” aren’t necessarily any good at. You’ll start filling those niches (because there’s always more work to do), and as soon as you find something you’re good at, you’ll probably migrate towards doing it full-time.

In fact, I don’t think you need to know anything, as long as you can stay alive somehow.

So they’re right: you don’t need to know math, and you can get by for your entire life just fine without it.

But a few things I’ve learned recently might surprise you:

    1. Math is a lot easier to pick up after you know how to program. In fact, if you’re a halfway decent programmer, you’ll find it’s almost a snap.
    1. They teach math all wrong in school. Way, WAY wrong. If you teach yourself math the right way, you’ll learn faster, remember it longer, and it’ll be much more valuable to you as a programmer.
    1. Knowing even a little of the right kinds of math can enable you do write some pretty interesting programs that would otherwise be too hard. In other words, math is something you can pick up a little at a time, whenever you have free time.
    1. Nobody knows all of math, not even the best mathematicians. The field is constantly expanding, as people invent new formalisms to solve their own problems. And with any given math problem, just like in programming, there’s more than one way to do it. You can pick the one you like best.
  1. Math is… ummm, please don’t tell anyone I said this; I’ll never get invited to another party as long as I live. But math, well… I’d better whisper this, so listen up: (it’s actually kinda fun.)

The Math You Learned (And Forgot)

Here’s the math I learned in school, as far as I can remember:

Grade School: Numbers, Counting, Arithmetic, Pre-Algebra (“story problems”)

High School: Algebra, Geometry, Advanced Algebra, Trigonometry, Pre-Calculus (conics and limits)

College: Differential and Integral Calculus, Differential Equations, Linear Algebra, Probability and Statistics, Discrete Math

How’d they come up with that particular list for high school, anyway? It’s more or less the same courses in most U.S. high schools. I think it’s very similar in other countries, too, except that their students have finished the list by the time they’re nine years old. (Americans really kick butt at monster-truck competitions, though, so it’s not a total loss.)

Algebra? Sure. No question. You need that. And a basic understanding of Cartesian geometry, too. Those are useful, and you can learn everything you need to know in a few months, give or take. But the rest of them? I think an introduction to the basics might be useful, but spending a whole semester or year on them seems ridiculous.

I’m guessing the list was designed to prepare students for science and engineering professions. The math courses they teach in and high school don’t help ready you for a career in programming, and the simple fact is that the number of programming jobs is rapidly outpacing the demand for all other engineering roles.

And even if you’re planning on being a scientist or an engineer, I’ve found it’s much easier to learn and appreciate geometry and trig after you understand what exactly math is — where it came from, where it’s going, what it’s for. No need to dive right into memorizing geometric proofs and trigonometric identities. But that’s exactly what high schools have you do.

So the list’s no good anymore. Schools are teaching us the wrong math, and they’re teaching it the wrong way. It’s no wonder programmers think they don’t need any math: most of the math we learned isn’t helping us.

The Math They Didn’t Teach You

The math computer scientists use regularly, in real life, has very little overlap with the list above. For one thing, most of the math you learn in grade school and high school is continuous: that is, math on the real numbers. For computer scientists, 95% or more of the interesting math is discrete: i.e., math on the integers.

I’m going to talk in a future blog about some key differences between computer science, software engineering, programming, hacking, and other oft-confused disciplines. I got the basic framework for these (upcoming) insights in no small part from Richard Gabriel’s Patterns Of Software, so if you absolutely can’t wait, go read that. It’s a good book.

For now, though, don’t let the term “computer scientist” worry you. It sounds intimidating, but math isn’t the exclusive purview of computer scientists; you can learn it all by yourself as a closet hacker, and be just as good (or better) at it than they are. Your background as a programmer will help keep you focused on the practical side of things.

The math we use for modeling computational problems is, by and large, math on discrete integers. This is a generalization. If you’re with me on today’s blog, you’ll be studying a little more math from now on than you were planning to before today, and you’ll discover places where the generalization isn’t true. But by then, a short time from now, you’ll be confident enough to ignore all this and teach yourself math the way you want to learn it.

For programmers, the most useful branch of discrete math is probability theory. It’s the first thing they should teach you after arithmetic, in grade school. What’s probability theory, you ask? Why, it’s counting. How many ways are there to make a Full House in poker? Or a Royal Flush? Whenever you think of a question that starts with “how many ways…” or “what are the odds…”, it’s a probability question. And as it happens (what are the odds?), it all just turns out to be “simple” counting. It starts with flipping a coin and goes from there. It’s definitely the first thing they should teach you in grade school after you learn Basic Calculator Usage.

I still have my discrete math textbook from college. It’s a bit heavyweight for a third-grader (maybe), but it does cover a lot of the math we use in “everyday” computer science and computer engineering.

Oddly enough, my professor didn’t tell me what it was for. Or I didn’t hear. Or something. So I didn’t pay very close attention: just enough to pass the course and forget this hateful topic forever, because I didn’t think it had anything to do with programming. That happened in quite a few of my comp sci courses in college, maybe as many as 25% of them. Poor me! I had to figure out what was important on my own, later, the hard way.

I think it would be nice if every math course spent a full week just introducing you to the subject, in the most fun way possible, so you know why the heck you’re learning it. Heck, that’s probably true for every course.

Aside from probability and discrete math, there are a few other branches of mathematics that are potentially quite useful to programmers, and they usually don’t teach them in school, unless you’re a math minor. This list includes:

    • Statistics, some of which is covered in my discrete math book, but it’s really a discipline of its own. A pretty important one, too, but hopefully it needs no introduction.
    • Algebra and Linear Algebra (i.e., matrices). They should teach Linear Algebra immediately after algebra. It’s pretty easy, and it’s amazingly useful in all sorts of domains, including machine learning.
    • Mathematical Logic. I have a really cool totally unreadable book on the subject by Stephen Kleene, the inventor of the Kleene closure and, as far as I know, Kleenex. Don’t read that one. I swear I’ve tried 20 times, and never made it past chapter 2. If anyone has a recommendation for a better introduction to this field, please post a comment. It’s obviously important stuff, though.
  • Information Theory and Kolmogorov Complexity. Weird, eh? I bet none of your high schools taught either of those. They’re both pretty new. Information theory is (veeery roughly) about data compression, and Kolmogorov Complexity is (also roughly) about algorithmic complexity. I.e., how small you can you make it, how long will it take, how elegant can the program or data structure be, things like that. They’re both fun, interesting and useful.

There are others, of course, and some of the fields overlap. But it just goes to show: the math that you’ll find useful is pretty different from the math your school thought would be useful.

What about calculus? Everyone teaches it, so it must be important, right?

Well, calculus is actually pretty easy. Before I learned it, it sounded like one of the hardest things in the universe, right up there with quantum mechanics. Quantum mechanics is still beyond me, but calculus is nothing. After I realized programmers can learn math quickly, I picked up myCalculus textbook and got through the entire thing in about a month, reading for an hour an evening.

Calculus is all about continuums — rates of change, areas under curves, volumes of solids. Useful stuff, but the exact details involve a lot of memorization and a lot of tedium that you don’t normally need as a programmer. It’s better to know the overall concepts and techniques, and go look up the details when you need them.

Geometry, trigonometry, differentiation, integration, conic sections, differential equations, and their multidimensional and multivariate versions — these all have important applications. It’s just that you don’t need to know them right this second. So it probably wasn’t a great idea to make you spend years and years doing proofs and exercises with them, was it? If you’re going to spend that much time studying math, it ought to be on topics that will remain relevant to you for life.

The Right Way To Learn Math

The right way to learn math is breadth-first, not depth-first. You need to survey the space, learn the names of things, figure out what’s what.

To put this in perspective, think about long division. Raise your hand if you can do long division on paper, right now. Hands? Anyone? I didn’t think so.

I went back and looked at the long-division algorithm they teach in grade school, and damn if it isn’t annoyingly complicated. It’s deterministic, sure, but you never have to do it by hand, because it’s easier to find a calculator, even if you’re stuck on a desert island without electricity. You’ll still have a calculator in your watch, or your dental filling, or something.

Why do they even teach it to you? Why do we feel vaguely guilty if we can’t remember how to do it? It’s not as if we need to know it anymore. And besides, if your life were on the line, you know you could perform long division of any arbitrarily large numbers. Imagine you’re imprisoned in some slimy 3rd-world dungeon, and the dictator there won’t let you out until you’ve computed 219308862/103503391. How would you do it? Well, easy. You’d start subtracting the denominator from the numerator, keeping a counter, until you couldn’t subtract it anymore, and that’d be the remainder. If pressed, you could figure out a way to continue using repeated subtraction to estimate the remainder as decimal number (in this case, 0.1185678219, or so my Emacs M-x calc tells me. Close enough!)

You could figure it out because you know that division is just repeated subtraction. The intuitive notion of division is deeply ingrained now.

The right way to learn math is to ignore the actual algorithms and proofs, for the most part, and to start by learning a little bit about all the techniques: their names, what they’re useful for, approximately how they’re computed, how long they’ve been around, (sometimes) who invented them, what their limitations are, and what they’re related to. Think of it as a Liberal Arts degree in mathematics.

Why? Because the first step to applying mathematics is problem identification. If you have a problem to solve, and you have no idea where to start, it could take you a long time to figure it out. But if you know it’s a differentiation problem, or a convex optimization problem, or a boolean logic problem, then you at least know where to start looking for the solution.

There are lots and lots of mathematical techniques and entire sub-disciplines out there now. If you don’t know what combinatorics is, not even the first clue, then you’re not very likely to be able to recognize problems for which the solution is found in combinatorics, are you?

But that’s actually great news, because it’s easier to read about the field and learn the names of everything than it is to learn the actual algorithms and methods for modeling and computing the results. In school they teach you the Chain Rule, and you can memorize the formula and apply it on exams, but how many students really know what it “means”? So they’re not going to be able to know to apply the formula when they run across a chain-rule problem in the wild. Ironically, it’s easier to know what it is than to memorize and apply the formula. The chain rule is just how to take the derivative of “chained” functions — meaning, function x() calls function g(), and you want the derivative of x(g()). Well, programmers know all about functions; we use them every day, so it’s much easier to imagine the problem now than it was back in school.

Which is why I think they’re teaching math wrong. They’re doing it wrong in several ways. They’re focusing on specializations that aren’t proving empirically to be useful to most high-school graduates, and they’re teaching those specializations backwards. You should learn how to count, and how to program, before you learn how to take derivatives and perform integration.

I think the best way to start learning math is to spend 15 to 30 minutes a day surfing in Wikipedia. It’s filled with articles about thousands of little branches of mathematics. You start with pretty much any article that seems interesting (e.g. String theory, say, or the Fourier transform, or Tensors, anything that strikes your fancy.) Start reading. If there’s something you don’t understand, click the link and read about it. Do this recursively until you get bored or tired.

Doing this will give you amazing perspective on mathematics, after a few months. You’ll start seeing patterns — for instance, it seems that just about every branch of mathematics that involves a single variable has a more complicated multivariate version, and the multivariate version is almost always represented by matrices of linear equations. At least for applied math. So Linear Algebra will gradually bump its way up your list, until you feel compelled to learn how it actually works, and you’ll download a PDF or buy a book, and you’ll figure out enough to make you happy for a while.

With the Wikipedia approach, you’ll also quickly find your way to the Foundations of Mathematics, the Rome to which all math roads lead. Math is almost always about formalizing our “common sense” about some domain, so that we can deduce and/or prove new things about that domain. Metamathematics is the fascinating study of what the limits are on math itself: the intrinsic capabilities of our formal models, proofs, axiomatic systems, and representations of rules, information, and computation.

One great thing that soon falls by the wayside is notation. Mathematical notation is the biggest turn-off to outsiders. Even if you’re familiar with summations, integrals, polynomials, exponents, etc., if you see a thick nest of them your inclination is probably to skip right over that sucker as one atomic operation.

However, by surveying math, trying to figure out what problems people have been trying to solve (and which of these might actually prove useful to you someday), you’ll start seeing patterns in the notation, and it’ll stop being so alien-looking. For instance, a summation sign (capital-sigma) or product sign (capital-pi) will look scary at first, even if you know the basics. But if you’re a programmer, you’ll soon realize it’s just a loop: one that sums values, one that multiplies them. Integration is just a summation over a continuous section of a curve, so that won’t stay scary for very long, either.

Once you’re comfortable with the many branches of math, and the many different forms of notation, you’re well on your way to knowing a lot of useful math. Because it won’t be scary anymore, and next time you see a math problem, it’ll jump right out at you. “Hey,” you’ll think, “I recognize that. That’s a multiplication sign!”

And then you should pull out the calculator. It might be a very fancy calculator such as R, Matlab, Mathematica, or a even C library for support vector machines. But almost all useful math is heavily automatable, so you might as well get some automated servants to help you with it.

When Are Exercises Useful?

After a year of doing part-time hobbyist catch-up math, you’re going to be able to do a lot more math in your head, even if you never touch a pencil to a paper. For instance, you’ll see polynomials all the time, so eventually you’ll pick up on the arithmetic of polynomials by osmosis. Same with logarithms, roots, transcendentals, and other fundamental mathematical representations that appear nearly everywhere.

I’m still getting a feel for how many exercises I want to work through by hand. I’m finding that I like to be able to follow explanations (proofs) using a kind of “plausibility test” — for instance, if I see someone dividing two polynomials, I kinda know what form the result should take, and if their result looks more or less right, then I’ll take their word for it. But if I see the explanation doing something that I’ve never heard of, or that seems wrong or impossible, then I’ll dig in some more.

That’s a lot like reading programming-language source code, isn’t it? You don’t need to hand-simulate the entire program state as you read someone’s code; if you know what approximate shape the computation will take, you can simply check that their result makes sense. E.g. if the result should be a list, and they’re returning a scalar, maybe you should dig in a little more. But normally you can scan source code almost at the speed you’d read English text (sometimes just as fast), and you’ll feel confident that you understand the overall shape and that you’ll probably spot any truly egregious errors.

I think that’s how mathematically-inclined people (mathematicians and hobbyists) read math papers, or any old papers containing a lot of math. They do the same sort of sanity checks you’d do when reading code, but no more, unless they’re intent on shooting the author down.

With that said, I still occasionally do math exercises. If something comes up again and again (like algebra and linear algebra), then I’ll start doing some exercises to make sure I really understand it.

But I’d stress this: don’t let exercises put you off the math. If an exercise (or even a particular article or chapter) is starting to bore you, move on. Jump around as much as you need to. Let your intuition guide you. You’ll learn much, much faster doing it that way, and your confidence will grow almost every day.

How Will This Help Me?

Well, it might not — not right away. Certainly it will improve your logical reasoning ability; it’s a bit like doing exercise at the gym, and your overall mental fitness will get better if you’re pushing yourself a little every day.

For me, I’ve noticed that a few domains I’ve always been interested in (including artificial intelligence, machine learning, natural language processing, and pattern recognition) use a lot of math. And as I’ve dug in more deeply, I’ve found that the math they use is no more difficult than the sum total of the math I learned in high school; it’s just different math, for the most part. It’s not harder. And learning it is enabling me to code (or use in my own code) neural networks, genetic algorithms, bayesian classifiers, clustering algorithms, image matching, and other nifty things that will result in cool applications I can show off to my friends.

And I’ve gradually gotten to the point where I no longer break out in a cold sweat when someone presents me with an article containing math notation: n-choose-k, differentials, matrices, determinants, infinite series, etc. The notation is actually there to make it easier, but (like programming-language syntax) notation is always a bit tricky and daunting on first contact. Nowadays I can follow it better, and it no longer makes me feel like a plebian when I don’t know it. Because I know I can figure it out.

And that’s a good thing.

And I’ll keep getting better at this. I have lots of years left, and lots of books, and articles. Sometimes I’ll spend a whole weekend reading a math book, and sometimes I’ll go for weeks without thinking about it even once. But like any hobby, if you simply trust that it will be interesting, and that it’ll get easier with time, you can apply it as often or as little as you like and still get value out of it.

Math every day. What a great idea that turned out to be!

发表在 未分类 | 留下评论

How did WordPress win?


How did WordPress win?

When we are passionate about something, it is sometimes hard for us to wrap our heads around why someone else might not be passionate about the same thing. You see this in the WordPress community often – fans and users of WordPress are often flabbergasted that someone might choose something else. Why would anyone choose Movable Type for instance?

Believe it or not, members of the Movable Type community often wonder the same thing. Most recently someone in the ProNet community, frustrated by their experience with WordPress, asked the question: how on Earth did WordPress win the battle over Movable Type?” The question was rhetorical, but sparked a very interesting dialog in our community.

In the past I have refrained from answering such questions, or if I did, I would not respond publicly, for reasons I can only attribute to a mentality that was beaten into me while I worked at Six Apart:

“Byrne you are a leader in the community, and your words carry significant weight. Therefore be very, very careful what you say. Very careful. Don’t do or say anything to jeopardize the company’s product line. In fact, if you want to say anything, why don’t you run it by me first? And Anil, and through marketing, and while you are at it through a couple other people as well… Cool?”

This time however, “to hell with it” I say. Let’s talk about this. Let’s see what lessons can be learned from WordPress so that others seeking to build a successful product can learn from it.

Why did WordPress win the Blogging Battle?

This is not the first time this question has been posed obviously. And in all the times people have sought an answer to this question, the answers are remarkably consistant. They are:

  • Movable Type’s licensing fiasco in 2004 angered the community and drove users to WordPress.
  • Movable Type is not open source. WordPress is.
  • Movable Type is written in Perl, while WordPress is written in PHP.

These answers are of course all correct to an extent, but do not account for WordPress’ success by themselves. Not by a long shot. The truth is that WordPress won for a whole host of reasons, including the act that WordPress has more themes, more plugins or a larger community. These too are important considerations, but these are by-products of its success, not the reasons for its success.

So let’s break it down shall we? Let’s talk about the commonly cited reasons for WordPress’ success, and some less well known reasons as well.

Movable Type’s Licensing Fiasco, and WordPress is Open Source

When Movable Type changed its license in 2004, it proved to be a significant turning point for WordPress. Yes, the change angered a lot of people and led to a lot of loyal Movable Type users deciding to switch to WordPress. More importantly however is that it gave WordPress the opportunity to change the nature of the debate, and let it very compellingly espouse the superiority of free over all else, even superior design, superior feature sets, and superior support.

What resonated with customers first and foremost however was not WordPress’ license, but the fact that it was unambiguously free. Back then no one knew that much about open source, much less the GPL, but what they did know was all that mattered: open source means free. Period. Forever.

The fact that Movable Type was in all reality free for the vast majority of people using it was irrelevant because it was never clear when Movable Type was free and when it was not. And what users feared most of all, is a repeat of exactly what happened the day Movable Type announced its licensing change: one day waking up to the realization that you owe some company hundreds, if not thousands of dollars1 and not being able to afford or justify the cost monetarily or on principle.

WordPress is easy to install

The fact that WordPress has always been easy to install, especially when compared to Movable Type, has always played a significant role in its growth and adoption rate. Technically, the reasons behind WordPress’ famed 5-minute install can be attributed largely to PHP’s deployment model, which was architected specifically to address the challenges associated with running and hosting web applications based on CGI, or in effect Perl 2 – the Internet’s first practical web programming language.

Furthermore, every web host likes to configure CGI differently on their web server, which led to a lot of confusion and frustration for a lot of users, and prevented anyone from authoring a simple and canonical installation guide for all Movable Type users across all web hosts.

One cannot underestimate how important ones installation experience with a piece of software is, because it frames every subsequent experience and impression they have of the product. So while blogging was exploding and people were weighing their options between Movable Type and WordPress, its no wonder why increasingly more and more people chose WordPress, even though it had fewer features, and an inferior design. Fewer people gave up trying to install it.

WordPress is written in PHP

Unfortunately it is impossible to avoid the Perl vs PHP debate when it comes to WordPress and Movable Type, and the fact that cogent and compelling arguments can be made and demonstrated that Perl knowledge has never been required, not once, not ever, to build a web site using Movable Type doesn’t matter. People simply feel more comfortable working with PHP. And even though the vast majority of people will never have or have ever had the need to hack the source code of their CMS, they are still comforted knowing that they could if they had to. People just never had that kind of comfort level with Perl and by association, Movable Type. Perl is just simply too scary.

That being said, the fact that people feel more comfortable hacking PHP did and still to this day, contributes significantly to the number of plugins and themes available on WordPress simply because the world of people who possess the bare minimum of knowledge necessary to write a plugin is so much larger.

Which leads me to another, and arguably more important reason why WordPress has been so successful: corporate adoption. If you are going to build your company on top of or rely heavily upon a CMS, and you are going to hire engineers internally to help you maintain it, which is an easier and cheaper job req to fill? A Perl engineer or a PHP engineer? Dollars to donuts, the answer is almost certainly PHP. Furthermore, if you know how companies often select the software they use, then you know that companies most frequently use the software their team members are most familiar with. And as more and more people started using WordPress at home, more and more people began recommending it to their bosses at work. And eventually, even though Movable Type dominated the Enterprise sector for so long, provided far superior support, and had a lock on the features Enterprises so often require (Oracle, SQL Server, LDAP support for instance) eventually Movable Type lost mindshare behind the firewall.

WordPress has a huge community

All of the factors above contributed in the long run to what ended up being WordPress’ single most important asset: its community. But its community was not born simply out of having a lot of users. Its community and ultimately WordPress’ success was born out a steady stream of people who began to rely upon WordPress as their primary, if not exclusive source of income. A healthy economy around WordPress consulting and professional services ultimately gave rise to “Premium Themes.” And once people began to demonstrate that there was a viable business model in selling themes, the theme market exploded. Now it is almost impossible to rival the selection of themes available on the platform, not to mention how cheap it is for the average person to get started with a cheap, good looking web site.

As more and more people though began making money using and building for the platform, as more and more people began thinking about, living in, and becoming invested in the platform, there became an ever increasing incentive for them to contribute back to the platform. Now, the great irony is that even for all of WordPress’ open sourcey, socialist, hippy goodness, it is the competition driven by the capitalist free market that drives much of WordPress’ innovation today.

Forces beyond anyone’s control

What’s fascinating of course is that all of the above are things that happened outside the control of any one person or company. For example, WordPress never chose its license, or the language it was written in.

That being said, there were also a number of tactics employed by Automattic and mistakes made by Six Apart, that collectively had an equal role to play in the fate of their respective platforms.

The Cult of WordPress

One thing that I personally feel mars an otherwise untarnished product is the fact that WordPress’ leadership and community chose to define itself early on not upon its own strengths, but upon the mistakes made by a young and inexperienced pair of entrepenuers. WordPress defined itself not as superior product by its own merit, but as the underdog. It succeeded by villifying Six Apart, by casting doubt on Six Apart’s integrity and by constantly stoking the fires left over from Movable Type’s licensing fiasco. Never for example have I seen a WordPress user work to establish a more positive and constructive tone when it comes to its competition.

This general lack of civility, much more apparent early on in WordPress’ life, contributed to an underlying sense that WordPress was the best and everything else sucked. This state of mind, love it or hate it, served WordPress greatly, because wars, even a meaningless “blogging war,” are only successfully fought when everyone knows who their enemy is. And Six Apart was not just a worthy competitor, it was the perfect enemy.

Automattic’s Switch Campaign

One thing rarely cited by the outside world, probably because it was not visible or apparent to anyone, was the systematic targeting of high profile brands to switch from using any competing platform to using WordPress. In fact, in the four years I was at Six Apart, if I had a dollar every time a significant and loyal TypePad and Movable Type customer confided in me that an employee of Automattic cold called them to encourage and entice them to switch to WordPress I would have quit a rich man. Automattic would extend whatever services it could, at no expense to the customer, getting them to switch. They would give away hosting services. They would freely dedicate engineers to the task of migrating customers’ data from one system to another. They would do whatever it took to move people to WordPress.

And once a migration was complete they did the single most important thing: they blogged the hell out of it. They made the story about how another customer switched from Movable Type or Type Pad to WordPress. They very smartly never let the sense that the world was switching to WordPress from ever disapating, even as TypePad and Movable Type was growing in users and revenue quarter after quarter.

Granted, no one switched to WordPress against their will. Simply put though, Six Apart was just not working as effectively giving people a reason to stay as Automattic was at taking away every reason a person had for sticking with their current platform.

Six Apart’s Purchase of Apperceptive

Even as Movable Type’s community started to become small in comparison to WordPress’, its community was still just as competitive. Its community was strong for the same reason that WordPress’ was – it consisted of a number of very bright, and exceedingly dedicated community members who were as invested in their respective trades as Six Apart and Automattic ever were.

Then Six Apart purchased Apperceptive. It was a great business move from a revenue stand point, but the consequences to the community were devistating in the long run. Here’s why:

Six Apart’s purchase of Apperceptive was successful, by all measure and accounts. Business increased, enterprises flocked to the platform and Movable Type was growing at an even faster clip. In order to meet the demand of the new business though, Six Apart began to hire the smartest and most innovative members from its community into its professional services team. Once hired, all of the awesome work they were doing got swallowed by the increasingly closed and proprietary Six Apart professional services ecosystem.

What’s worse is the fact that Six Apart sapped its community of its greatest leaders and contributors. And slowly over time, the number of professional and truly capable professional service providers got widdled down to a very small list. Six Apart, without knowing it or purposefully doing it, created a monopoly. Customers coming to the platform, looking for an alternative to Six Apart for their professional services needs, found only a hand full of independent contractors, contributing to the sense that Movable Type’s community was too small to support them.

Six Apart’s Failure

Finally, I will add one more contributing factor to WordPress’ success: Six Apart’s failure. The reasons behind its ultimate failure as a product company are many, are complex and in many cases very nuanced. But the general consensus is apt: Six Apart severely hampered its own ability to compete effectively by spreading its many exceptionally talented resources across too many products.

In short, Six Apart lacked focus.

If Six Apart, early on, had made the decision to put all of its resources behind a single product and codebase, TypePad or Movable Type for example, then I think the blogging landscape would be a fundamentally different place today. WordPress would undoubtedly still be popular, but it might still have a very potent adversary and competitor helping to drive innovation and the technology behind blogging.

Who won the war?

It is pointless to refute that WordPress came out on top. But I personally find the conceit of a “war” to be faulty premise. The “war” between WordPress and Movable Type was either manufactured or the natural by product of a rivalry that two communities had come to define themselves by. It is a mentality I find fundamentally poisonous to all who engage in it because it promotes the idea that one platform is inherently better than another. The truth of course is that each platform, be it WordPress, Drupal, Expression Engine, Movable Type, Simple CMS, TypePad, Twitter, Instagram, Tumblr, or what have you does different things uniquely well. That is why I prefer a perspective that embraces and recognizes each platform for its strengths, and never denigrates those who have made a personal decision to choose one platform over another.

All this being said, no doubt people will always press the question: is WordPress’ success evidence that it really is a better product? The answer to that is a no-brainer to those who have already made up their mind.

For my part, I still maintain that Movable Type is an successful and yes, even a great product. Afterall, it continues to support me, not to mention many of my friends and their families. It also supports a very successful and profitable company and ecosystem in Japan, not to mention hundreds upon hundreds of people world wide. Plus, who can ignore the fact that Movable Type still powers much of the web today, and is in use by some of the largest and most influential media properties on the planet.

For those reasons, and a whole host of others, both personal and technical, I choose Movable Type, and of course Melody. And I would choose it again and again and again given the opportunity. But that is me.

1 To the best of my knowledge, not once did Six Apart ever police or enforce its license. From the day Ben and Mena started collecting donations to fund the development of Movable Type, Six Apart relied exclusively upon the honor system when it came to collecting payments for people’s usage of the platform. One story in particular exemplifies Six Apart’s attritude towards its very own license, a story that can only be described as legend within the walls of Six Apart: that the Huffington Post, the poster child of Movable Type, never actually paid for their license to use the software. To this day, even as Huffingpost is sold for over $350,000,000, its success can be attributed to the effectively free platform it built a business on.

2 What makes Movable Type hard to install has actually nothing to do with Perl at all. It has to do with CGI. CGI was originally architected to allow any script to be run and invoked via an addressable URL, and when that capability was first introduced system administrators and programmers feared the security ramifications of allowing any arbitrary script to be executed in that fashion. Therefore, they instituted a number limitations enforced by the web server: 1) only certain directories on your web server can possess the ability to run CGI scripts, 2) only executable files can be invoked via CGI, and 3) no static files (html, css, javascript, or any text file) can be served from the same directory as a CGI script. These limitations are often inappropriately attributed to Perl only because Perl became the dominant, if not the only scripting language used to author CGI based web applications early on.

Disclaimer: Byrne Reese is the former Product Manager of Movable Type and TypePad and worked at Six Apart from 2004 to 2008. Byrne Reese is now a Partner at Endevver, LLC, a premiere Movable Type and Melody consulting company, as well as the chairman and a leading contributor to Melody, a fork of the Movable Type platform.

发表在 未分类 | 留下评论

Should I buy a Fast SSD or more memory?

Should I buy a Fast SSD or more memory?

April 8, 2010 By Vadim Tkachenko 10 Comments

While a scale-out solution has traditionally been popular for MySQL, it’s interesting to see what room we now have to scale up – cheap memory, fast storage, better power efficiency.  There certainly are a lot of options now – I’ve been meeting about a customer/week using Fusion-IO cards.  One interesting choice I’ve seen people make however, is buying an SSD when they still have a lot of pages read/second – I would have preferred to buy memory instead, and use the storage device for writes.

Here’s the benchmark I came up with to confirm if this is the case:

  • Percona-XtraDB-9.1 release
  • Sysbench OLTP workload with 80 million rows (about 18GB worth of data+indexes)
  • XFS Filesystem mounted with nobarrier option.
  • Tests run with:
    • RAID10 with BBU over 8 disks
    • Intel SSD X25-E 32GB
    • FusionIO 320GB MLC
  • For each test, run with a buffer pool of between 2G and 22G (to test performance compared to memory fit).
  • Hardware was our Dell 900 (specs here).

To start with, we have a test on the RAID10 storage to establish a baseline.  The Y axis is transactions/second (more is better), the X axis is the size of innodb_buffer_pool_size:

Let me point out three interesting characteristics about this benchmark:

  • The A arrow is when data fits completely in the buffer pool (best performance). It’s important to point out that once you hit this point, a further increase in memory at all.
  • The B arrow is where the data just started to exceed the size of the buffer pool.  This is the most painful point for many customers – because while memory decreased by only ~10% the performance dropped by 2.6 times!  In production this usually matches the description of “Last week everything was fine.. but it’s just getting slower and slower!”.  I would suggest that adding memory is by far the best thing to do here.
  • The C arrow shows where data is approximately three times the buffer pool.  This is an interesting point to zoom in on – since you may not be able to justify the cost of the memory, but an SSD might be a good fit:

Where the C arrow was, in this graph a Fusion-IO card improves performance by about five times (or 2x with an Intel SSD).  To get the same improvement with memory, you would have needed to add 60% more memory -or- 260% more memory for a 5x improvement.  Imagine a situation where your C point is when you have 32GB of RAM and 100GB of data.  Than it gets interesting:

  • Can you easily add another 32G RAM (are your memory slots already filled?)
  • Does your budget allow to install SSD cards? (You may still need more than one, since they are all relatively small.  There are already appliances on the market which use 8 Intel SSD devices).
  • Is a 2x or 5x improvement enough?  There are more wins to be had if you can afford to buy all the memory that is required.

The workload here is designed to keep as much of the data hot as possible, but I guess the main lesson here is not to underestimate the size of your “active set” of data.  For some people who just append data to some sort of logging table it may only need to be a small percentage – but in other cases it can be considerably higher.  If you don’t know what your working set is – ask us!

Important note: This graph and these results are valid only for sysbench uniform. In your particular workload the points B and C may be located in differently.

Raw results:

Buffer pool, GB FusionIO Intel SSD RAID 10
2 450.3 186.33 80.67
4 538.19 230.35 99.73
6 608.15 268.18 121.71
8 679.44 324.03 201.74
10 769.44 407.56 252.84
12 855.89 511.49 324.38
14 976.74 664.38 429.15
16 1127.23 836.17 579.29
18 1471.98 1236.9 934.78
20 2536.16 2485.63 2486.88
22 2433.13 2492.06 2448.88
发表在 未分类 | 留下评论

The NoSQL Ecosystem


Chinese: http://blog.nosqlfan.com/html/2171.html

The Architecture of Open Source Applications The Architecture of
Open Source Applications

Amy Brown and Greg Wilson (eds.)
ISBN 978-1-257-63801-7
License / Buy / News / Contribute / FAQ

Chapter 13. The NoSQL Ecosystem

Adam Marcus


Unlike most of the other projects in this book, NoSQL is not a tool, but an ecosystem composed of several complimentary and competing tools. The tools branded with the NoSQL monicker provide an alternative to SQL-based relational database systems for storing data. To understand NoSQL, we have to understand the space of available tools, and see how the design of each one explores the space of data storage possibilities.


If you are considering using a NoSQL storage system, you should first understand the wide space of options that NoSQL systems span. NoSQL systems do away with many of the traditional comforts of relational database systems, and operations which were typically encapsulated behind the system boundary of a database are now left to application designers. This requires you to take on the hat of a systems architect, which requires a more in-depth understanding of how such systems are built.


13.1. What’s in a Name?

In defining the space of NoSQL, let’s first take a stab at defining the name. Taken literally, a NoSQL system presents a query interface to the user that is not SQL. The NoSQL community generally takes a more inclusive view, suggesting that NoSQL systems provide alternatives to traditional relational databases, and allow developers to design projects which use Not Only a SQL interface. In some cases, you might replace a relational database with a NoSQL alternative, and in others you will employ a mix-and-match approach to different problems you encounter in application development.

Before diving into the world of NoSQL, let’s explore the cases where SQL and the relational model suit your needs, and others where a NoSQL system might be a better fit.

13.1.1. SQL and the Relational Model

SQL is a declarative language for querying data. A declarative language is one in which a programmer specifies what they want the system to do, rather than procedurally defining how the system should do it. A few examples include: find the record for employee 39, project out only the employee name and phone number from their entire record, filter employee records to those that work in accounting, count the employees in each department, or join the data from the employees table with the managers table.

To a first approximation, SQL allows you to ask these questions without thinking about how the data is laid out on disk, which indices to use to access the data, or what algorithms to use to process the data. A significant architectural component of most relational databases is a query optimizer, which decides which of the many logically equivalent query plans to execute to most quickly answer a query. These optimizers are often better than the average database user, but sometimes they do not have enough information or have too simple a model of the system in order to generate the most efficient execution.

Relational databases, which are the most common databases used in practice, follow the relational data model. In this model, different real-world entities are stored in different tables. For example, all employees might be stored in an Employees table, and all departments might be stored in a Departments table. Each row of a table has various properties stored in columns. For example, employees might have an employee id, salary, birth date, and first/last names. Each of these properties will be stored in a column of the Employees table.

The relational model goes hand-in-hand with SQL. Simple SQL queries, such as filters, retrieve all records whose field matches some test (e.g., employeeid = 3, or salary > $20000). More complex constructs cause the database to do some extra work, such as joining data from multiple tables (e.g., what is the name of the department in which employee 3 works?). Other complex constructs such as aggregates (e.g., what is the average salary of my employees?) can lead to full-table scans.

The relational data model defines highly structured entities with strict relationships between them. Querying this model with SQL allows complex data traversals without too much custom development. The complexity of such modeling and querying has its limits, though:

  • Complexity leads to unpredictability. SQL’s expressiveness makes it challenging to reason about the cost of each query, and thus the cost of a workload. While simpler query languages might complicate application logic, they make it easier to provision data storage systems, which only respond to simple requests.
  • There are many ways to model a problem. The relational data model is strict: the schema assigned to each table specifies the data in each row. If we are storing less structured data, or rows with more variance in the columns they store, the relational model may be needlessly restrictive. Similarly, application developers might not find the relational model perfect for modeling every kind of data. For example, a lot of application logic is written in object-oriented languages and includes high-level concepts such as lists, queues, and sets, and some programmers would like their persistence layer to model this.
  • If the data grows past the capacity of one server, then the tables in the database will have to be partitioned across computers. To avoid JOINs having to cross the network in order to get data in different tables, we will have to denormalize it. Denormalization stores all of the data from different tables that one might want to look up at once in a single place. This makes our database look like a key-lookup storage system, leaving us wondering what other data models might better suit the data.

It’s generally not wise to discard many years of design considerations arbitrarily. When you consider storing your data in a database, consider SQL and the relational model, which are backed by decades of research and development, offer rich modeling capabilities, and provide easy-to-understand guarantees about complex operations. NoSQL is a good option when you have a specific problem, such as large amounts of data, a massive workload, or a difficult data modeling decision for which SQL and relational databases might not have been optimized.

13.1.2. NoSQL Inspirations

The NoSQL movement finds much of its inspiration in papers from the research community. While many papers are at the core of design decisions in NoSQL systems, two stand out in particular.

Google’s BigTable [CDG+06] presents an interesting data model, which facilitates sorted storage of multi-column historical data. Data is distributed to multiple servers using a hierarchical range-based partitioning scheme, and data is updated with strict consistency (a concept that we will eventually define in Section 13.5).

Amazon’s Dynamo [DHJ+07] uses a different key-oriented distributed datastore. Dynamo’s data model is simpler, mapping keys to application-specific blobs of data. The partitioning model is more resilient to failure, but accomplishes that goal through a looser data consistency approach called eventual consistency.

We will dig into each of these concepts in more detail, but it is important to understand that many of them can be mixed and matched. Some NoSQL systems such as HBase1 sticks closely to the BigTable design. Another NoSQL system named Voldemort2 replicates many of Dynamo’s features. Still other NoSQL projects such as Cassandra3 have taken some features from BigTable (its data model) and others from Dynamo (its partitioning and consistency schemes).

13.1.3. Characteristics and Considerations

NoSQL systems part ways with the hefty SQL standard and offer simpler but piecemeal solutions for architecting storage solutions. These systems were built with the belief that in simplifying how a database operates over data, an architect can better predict the performance of a query. In many NoSQL systems, complex query logic is left to the application, resulting in a data store with more predictable query performance because of the lack of variability in queries

NoSQL systems part with more than just declarative queries over the relational data. Transactional semantics, consistency, and durability are guarantees that organizations such as banks demand of databases.Transactions provide an all-or-nothing guarantee when combining several potentially complex operations into one, such as deducting money from one account and adding the money to another. Consistency ensures that when a value is updated, subsequent queries will see the updated value. Durability guarantees that once a value is updated, it will be written to stable storage (such as a hard drive) and recoverable if the database crashes.

NoSQL systems relax some of these guarantees, a decision which, for many non-banking applications, can provide acceptable and predictable behavior in exchange for improved performance. These relaxations, combined with data model and query language changes, often make it easier to safely partition a database across multiple machines when the data grows beyond a single machine’s capability.

NoSQL systems are still very much in their infancy. The architectural decisions that go into the systems described in this chapter are a testament to the requirements of various users. The biggest challenge in summarizing the architectural features of several open source projects is that each one is a moving target. Keep in mind that the details of individual systems will change. When you pick between NoSQL systems, you can use this chapter to guide your thought process, but not your feature-by-feature product selection.

As you think about NoSQL systems, here is a roadmap of considerations:

  • Data and query model: Is your data represented as rows, objects, data structures, or documents? Can you ask the database to calculate aggregates over multiple records?
  • Durability: When you change a value, does it immediately go to stable storage? Does it get stored on multiple machines in case one crashes?
  • Scalability: Does your data fit on a single server? Do the amount of reads and writes require multiple disks to handle the workload?
  • Partitioning: For scalability, availability, or durability reasons, does the data need to live on multiple servers? How do you know which record is on which server?
  • Consistency: If you’ve partitioned and replicated your records across multiple servers, how do the servers coordinate when a record changes?
  • Transactional semantics: When you run a series of operations, some databases allow you to wrap them in a transaction, which provides some subset of ACID (Atomicity, Consistency, Isolation, and Durability) guarantees on the transaction and all others currently running. Does your business logic require these guarantees, which often come with performance tradeoffs?
  • Single-server performance: If you want to safely store data on disk, what on-disk data structures are best-geared toward read-heavy or write-heavy workloads? Is writing to disk your bottleneck?
  • Analytical workloads: We’re going to pay a lot of attention to lookup-heavy workloads of the kind you need to run a responsive user-focused web application. In many cases, you will want to build dataset-sized reports, aggregating statistics across multiple users for example. Does your use-case and toolchain require such functionality?

While we will touch on all of these consideration, the last three, while equally important, see the least attention in this chapter.


13.2. NoSQL Data and Query Models

The data model of a database specifies how data is logically organized. Its query model dictates how the data can be retrieved and updated. Common data models are the relational model, key-oriented storage model, or various graph models. Query languages you might have heard of include SQL, key lookups, and MapReduce. NoSQL systems combine different data and query models, resulting in different architectural considerations.

13.2.1. Key-based NoSQL Data Models

NoSQL systems often part with the relational model and the full expressivity of SQL by restricting lookups on a dataset to a single field. For example, even if an employee has many properties, you might only be able to retrieve an employee by her ID. As a result, most queries in NoSQL systems are key lookup-based. The programmer selects a key to identify each data item, and can, for the most part, only retrieve items by performing a lookup for their key in the database.

In key lookup-based systems, complex join operations or multiple-key retrieval of the same data might require creative uses of key names. A programmer wishing to look up an employee by his employee ID and to look up all employees in a department might create two key types. For example, the key employee:30 would point to an employee record for employee ID 30, and employee_departments:20 might contain a list of all employees in department 20. A join operation gets pushed into application logic: to retrieve employees in department 20, an application first retrieves a list of employee IDs from key employee_departments:20, and then loops over key lookups for each employee:ID in the employee list.

The key lookup model is beneficial because it means that the database has a consistent query pattern—the entire workload consists of key lookups whose performance is relatively uniform and predictable. Profiling to find the slow parts of an application is simpler, since all complex operations reside in the application code. On the flip side, the data model logic and business logic are now more closely intertwined, which muddles abstraction.

Let’s quickly touch on the data associated with each key. Various NoSQL systems offer different solutions in this space.

Key-Value Stores

The simplest form of NoSQL store is a key-value store. Each key is mapped to a value containing arbitrary data. The NoSQL store has no knowledge of the contents of its payload, and simply delivers the data to the application. In our Employee database example, one might map the key employee:30 to a blob containing JSON or a binary format such as Protocol Buffers4, Thrift5, or Avro6 in order to encapsulate the information about employee 30.

If a developer uses structured formats to store complex data for a key, she must operate against the data in application space: a key-value data store generally offers no mechanisms for querying for keys based on some property of their values. Key-value stores shine in the simplicity of their query model, usually consisting of set,get, and delete primitives, but discard the ability to add simple in-database filtering capabilities due to the opacity of their values. Voldemort, which is based on Amazon’s Dynamo, provides a distributed key-value store. BDB7 offers a persistence library that has a key-value interface.

Key-Data Structure Stores

Key-data structure stores, made popular by Redis8, assign each value a type. In Redis, the available types a value can take on are integer, string, list, set, and sorted set. In addition to set/get/delete, type-specific commands, such as increment/decrement for integers, or push/pop for lists, add functionality to the query model without drastically affecting performance characteristics of requests. By providing simple type-specific functionality while avoiding multi-key operations such as aggregation or joins, Redis balances functionality and performance.

Key-Document Stores

Key-document stores, such as CouchDB9, MongoDB10, and Riak11, map a key to some document that contains structured information. These systems store documents in a JSON or JSON-like format. They store lists and dictionaries, which can be embedded recursively inside one-another.

MongoDB separates the keyspace into collections, so that keys for Employees and Department, for example, do not collide. CouchDB and Riak leave type-tracking to the developer. The freedom and complexity of document stores is a double-edged sword: application developers have a lot of freedom in modeling their documents, but application-based query logic can become exceedingly complex.

BigTable Column Family Stores

HBase and Cassandra base their data model on the one used by Google’s BigTable. In this model, a key identifies a row, which contains data stored in one or more Column Families (CFs). Within a CF, each row can contain multiple columns. The values within each column are timestamped, so that several versions of a row-column mapping can live within a CF.

Conceptually, one can think of Column Families as storing complex keys of the form (row ID, CF, column, timestamp), mapping to values which are sorted by their keys. This design results in data modeling decisions which push a lot of functionality into the keyspace. It is particularly good at modeling historical data with timestamps. The model naturally supports sparse column placement since row IDs that do not have certain columns do not need an explicit NULL value for those columns. On the flip side, columns which have few or no NULL values must still store the column identifier with each row, which leads to greater space consumption.

Each project data model differs from the original BigTable model in various ways, but Cassandra’s changes are most notable. Cassandra introduces the notion of a supercolumn within each CF to allow for another level of mapping, modeling, and indexing. It also does away with a notion of locality groups, which can physically store multiple column families together for performance reasons.

13.2.2. Graph Storage

One class of NoSQL stores are graph stores. Not all data is created equal, and the relational and key-oriented data models of storing and querying data are not the best for all data. Graphs are a fundamental data structure in computer science, and systems such as HyperGraphDB12 and Neo4J13 are two popular NoSQL storage systems for storing graph-structured data. Graph stores differ from the other stores we have discussed thus far in almost every way: data models, data traversal and querying patterns, physical layout of data on disk, distribution to multiple machines, and the transactional semantics of queries. We can not do these stark differences justice given space limitations, but you should be aware that certain classes of data may be better stored and queried as a graph.

13.2.3. Complex Queries

There are notable exceptions to key-only lookups in NoSQL systems. MongoDB allows you to index your data based on any number of properties and has a relatively high-level language for specifying which data you want to retrieve. BigTable-based systems support scanners to iterate over a column family and select particular items by a filter on a column. CouchDB allows you to create different views of the data, and to run MapReduce tasks across your table to facilitate more complex lookups and updates. Most of the systems have bindings to Hadoop or another MapReduce framework to perform dataset-scale analytical queries.

13.2.4. Transactions

NoSQL systems generally prioritize performance over transactional semantics. Other SQL-based systems allow any set of statements—from a simple primary key row retrieval, to a complicated join between several tables which is then subsequently averaged across several fields—to be placed in a transaction.

These SQL databases will offer ACID guarantees between transactions. Running multiple operations in a transaction is Atomic (the A in ACID), meaning all or none of the operations happen. Consistency (the C) ensures that the transaction leaves the database in a consistent, uncorrupted state. Isolation (the I) makes sure that if two transactions touch the same record, they will do without stepping on each other’s feet. Durability (the D, covered extensively in the next section), ensures that once a transaction is committed, it’s stored in a safe place.

ACID-compliant transactions keep developers sane by making it easy to reason about the state of their data. Imagine multiple transactions, each of which has multiple steps (e.g., first check the value of a bank account, then subtract $60, then update the value). ACID-compliant databases often are limited in how they can interleave these steps while still providing a correct result across all transactions. This push for correctness results in often-unexpected performance characteristics, where a slow transaction might cause an otherwise quick one to wait in line.

Most NoSQL systems pick performance over full ACID guarantees, but do provide guarantees at the key level: two operations on the same key will be serialized, avoiding serious corruption to key-value pairs. For many applications, this decision will not pose noticeable correctness issues, and will allow quick operations to execute with more regularity. It does, however, leave more considerations for application design and correctness in the hands of the developer.

Redis is the notable exception to the no-transaction trend. On a single server, it provides a MULTI command to combine multiple operations atomically and consistently, and a WATCH command to allow isolation. Other systems provide lower-level test-and-set functionality which provides some isolation guarantees.

13.2.5. Schema-free Storage

A cross-cutting property of many NoSQL systems is the lack of schema enforcement in the database. Even in document stores and column family-oriented stores, properties across similar entities are not required to be the same. This has the benefit of supporting less structured data requirements and requiring less performance expense when modifying schemas on-the-fly. The decision leaves more responsibility to the application developer, who now has to program more defensively. For example, is the lack of a lastname property on an employee record an error to be rectified, or a schema update which is currently propagating through the system? Data and schema versioning is common in application-level code after a few iterations of a project which relies on sloppy-schema NoSQL systems.


13.3. Data Durability

Ideally, all data modifications on a storage system would immediately be safely persisted and replicated to multiple locations to avoid data loss. However, ensuring data safety is in tension with performance, and different NoSQL systems make different data durability guarantees in order to improve performance. Failure scenarios are varied and numerous, and not all NoSQL systems protect you against these issues.

A simple and common failure scenario is a server restart or power loss. Data durability in this case involves having moved the data from memory to a hard disk, which does not require power to store data. Hard disk failure is handled by copying the data to secondary devices, be they other hard drives in the same machine (RAID mirroring) or other machines on the network. However, a data center might not survive an event which causes correlated failure (a tornado, for example), and some organizations go so far as to copy data to backups in data centers several hurricane widths apart. Writing to hard drives and copying data to multiple servers or data centers is expensive, so different NoSQL systems trade off durability guarantees for performance.

13.3.1. Single-server Durability

The simplest form of durability is a single-server durability, which ensures that any data modification will survive a server restart or power loss. This usually means writing the changed data to disk, which often bottlenecks your workload. Even if you order your operating system to write data to an on-disk file, the operating system may buffer the write, avoiding an immediate modification on disk so that it can group several writes together into a single operation. Only when the fsync system call is issued does the operating system make a best-effort attempt to ensure that buffered updates are persisted to disk.

Typical hard drives can perform 100-200 random accesses (seeks) per second, and are limited to 30-100 MB/sec of sequential writes. Memory can be orders of magnitudes faster in both scenarios. Ensuring efficient single-server durability means limiting the number of random writes your system incurs, and increasing the number of sequential writes per hard drive. Ideally, you want a system to minimize the number of writes betweenfsync calls, maximizing the number of those writes that are sequential, all the while never telling the user their data has been successfully written to disk until that write has been fsynced. Let’s cover a few techniques for improving performance of single-server durability guarantees.

Control fsync Frequency

Memcached14 is an example of a system which offers no on-disk durability in exchange for extremely fast in-memory operations. When a server restarts, the data on that server is gone: this makes for a good cache and a poor durable data store.

Redis offers developers several options for when to call fsync. Developers can force an fsync call after every update, which is the slow and safe choice. For better performance, Redis can fsync its writes every N seconds. In a worst-case scenario, the you will lose last N seconds worth of operations, which may be acceptable for certain uses. Finally, for use cases where durability is not important (maintaining coarse-grained statistics, or using Redis as a cache), the developer can turn off fsync calls entirely: the operating system will eventually flush the data to disk, but without guarantees of when this will happen.

Increase Sequential Writes by Logging

Several data structures, such as B+Trees, help NoSQL systems quickly retrieve data from disk. Updates to those structures result in updates in random locations in the data structures’ files, resulting in several random writes per update if you fsync after each update. To reduce random writes, systems such as Cassandra, HBase, Redis, and Riak append update operations to a sequentially-written file called a log. While other data structures used by the system are only periodically fsynced, the log is frequently fsynced. By treating the log as the ground-truth state of the database after a crash, these storage engines are able to turn random updates into sequential ones.

While NoSQL systems such as MongoDB perform writes in-place in their data structures, others take logging even further. Cassandra and HBase use a technique borrowed from BigTable of combining their logs and lookup data structures into one log-structured merge tree. Riak provides similar functionality with a log-structured hash table. CouchDB has modified the traditional B+Tree so that all changes to the data structure are appended to the structure on physical storage. These techniques result in improved write throughput, but require a periodic log compaction to keep the log from growing unbounded.

Increase Throughput by Grouping Writes

Cassandra groups multiple concurrent updates within a short window into a single fsync call. This design, calledgroup commit, results in higher latency per update, as users have to wait on several concurrent updates to have their own update be acknowledged. The latency bump comes at an increase in throughput, as multiple log appends can happen with a single fsync. As of this writing, every HBase update is persisted to the underlying storage provided by the Hadoop Distributed File System (HDFS)15, which has recently seen patches to allow support of appends that respect fsync and group commit.

13.3.2. Multi-server Durability

Because hard drives and machines often irreparably fail, copying important data across machines is necessary. Many NoSQL systems offer multi-server durability for data.

Redis takes a traditional master-slave approach to replicating data. All operations executed against a master are communicated in a log-like fashion to slave machines, which replicate the operations on their own hardware. If a master fails, a slave can step in and serve the data from the state of the operation log that it received from the master. This configuration might result in some data loss, as the master does not confirm that the slave has persisted an operation in its log before acknowledging the operation to the user. CouchDB facilitates a similar form of directional replication, where servers can be configured to replicate changes to documents on other stores.

MongoDB provides the notion of replica sets, where some number of servers are responsible for storing each document. MongoDB gives developers the option of ensuring that all replicas have received updates, or to proceed without ensuring that replicas have the most recent data. Many of the other distributed NoSQL storage systems support multi-server replication of data. HBase, which is built on top of HDFS, receives multi-server durability through HDFS. All writes are replicated to two or more HDFS nodes before returning control to the user, ensuring multi-server durability.

Riak, Cassandra, and Voldemort support more configurable forms of replication. With subtle differences, all three systems allow the user to specify N, the number of machines which should ultimately have a copy of the data, and W<N, the number of machines that should confirm the data has been written before returning control to the user.

To handle cases where an entire data center goes out of service, multi-server replication across data centers is required. Cassandra, HBase, and Voldemort have rack-aware configurations, which specify the rack or data center in which various machines are located. In general, blocking the user’s request until a remote server has acknowledged an update incurs too much latency. Updates are streamed without confirmation when performed across wide area networks to backup data centers.


13.4. Scaling for Performance

Having just spoken about handling failure, let’s imagine a rosier situation: success! If the system you build reaches success, your data store will be one of the components to feel stress under load. A cheap and dirty solution to such problems is to scale up your existing machinery: invest in more RAM and disks to handle the workload on one machine. With more success, pouring money into more expensive hardware will become infeasible. At this point, you will have to replicate data and spread requests across multiple machines to distribute load. This approach is called scale out, and is measured by the horizontal scalability of your system.

The ideal horizontal scalability goal is linear scalability, in which doubling the number of machines in your storage system doubles the query capacity of the system. The key to such scalability is in how the data is spread across machines. Sharding is the act of splitting your read and write workload across multiple machines to scale out your storage system. Sharding is fundamental to the design of many systems, namely Cassandra, HBase, Voldemort, and Riak, and more recently MongoDB and Redis. Some projects such as CouchDB focus on single-server performance and do not provide an in-system solution to sharding, but secondary projects provide coordinators to partition the workload across independent installations on multiple machines.

Let’s cover a few interchangeable terms you might encounter. We will use the terms sharding and partitioninginterchangeably. The terms machineserver, or node refer to some physical computer which stores part of the partitioned data. Finally, a cluster or ring refers to the set of machines which participate in your storage system.

Sharding means that no one machine has to handle the write workload on the entire dataset, but no one machine can answer queries about the entire dataset. Most NoSQL systems are key-oriented in both their data and query models, and few queries touch the entire dataset anyway. Because the primary access method for data in these systems is key-based, sharding is typically key-based as well: some function of the key determines the machine on which a key-value pair is stored. We’ll cover two methods of defining the key-machine mapping: hash partitioning and range partitioning.

13.4.1. Do Not Shard Until You Have To

Sharding adds system complexity, and where possible, you should avoid it. Let’s cover two ways to scale without sharding: read replicas and caching.

Read Replicas

Many storage systems see more read requests than write requests. A simple solution in these cases is to make copies of the data on multiple machines. All write requests still go to a master node. Read requests go to machines which replicate the data, and are often slightly stale with respect to the data on the write master.

If you are already replicating your data for multi-server durability in a master-slave configuration, as is common in Redis, CouchDB, or MongoDB, the read slaves can shed some load from the write master. Some queries, such as aggregate summaries of your dataset, which might be expensive and often do not require up-to-the-second freshness, can be executed against the slave replicas. Generally, the less stringent your demands for freshness of content, the more you can lean on read slaves to improve read-only query performance.


Caching the most popular content in your system often works surprisingly well. Memcached dedicates blocks of memory on multiple servers to cache data from your data store. Memcached clients take advantage of several horizontal scalability tricks to distribute load across Memcached installations on different servers. To add memory to the cache pool, just add another Memcached host.

Because Memcached is designed for caching, it does not have as much architectural complexity as the persistent solutions for scaling workloads. Before considering more complicated solutions, think about whether caching can solve your scalability woes. Caching is not solely a temporary band-aid: Facebook has Memcached installations in the range of tens of terabytes of memory!

Read replicas and caching allow you to scale up your read-heavy workloads. When you start to increase the frequency of writes and updates to your data, however, you will also increase the load on the master server that contains all of your up-to-date data. For the rest of this section, we will cover techniques for sharding your write workload across multiple servers.

13.4.2. Sharding Through Coordinators

The CouchDB project focuses on the single-server experience. Two projects, Lounge and BigCouch, facilitate sharding CouchDB workloads through an external proxy, which acts as a front end to standalone CouchDB instances. In this design, the standalone installations are not aware of each other. The coordinator distributes requests to individual CouchDB instances based on the key of the document being requested.

Twitter has built the notions of sharding and replication into a coordinating framework called Gizzard16. Gizzard takes standalone data stores of any type—you can build wrappers for SQL or NoSQL storage systems—and arranges them in trees of any depth to partition keys by key range. For fault tolerance, Gizzard can be configured to replicate data to multiple physical machines for the same key range.

13.4.3. Consistent Hash Rings

Good hash functions distribute a set of keys in a uniform manner. This makes them a powerful tool for distributing key-value pairs among multiple servers. The academic literature on a technique called consistent hashing is extensive, and the first applications of the technique to data stores was in systems called distributed hash tables (DHTs). NoSQL systems built around the principles of Amazon’s Dynamo adopted this distribution technique, and it appears in Cassandra, Voldemort, and Riak.

Hash Rings by Example

[A Distributed Hash Table Ring]Figure 13.1: A Distributed Hash Table Ring

Consistent hash rings work as follows. Say we have a hash function H that maps keys to uniformly distributed large integer values. We can form a ring of numbers in the range [1, L] that wraps around itself with these values by taking H(key) mod L for some relatively large integer L. This will map each key into the range [1,L]. A consistent hash ring of servers is formed by taking each server’s unique identifier (say its IP address), and applying H to it. You can get an intuition for how this works by looking at the hash ring formed by five servers (AE) in Figure 13.1.

There, we picked L = 1000. Let’s say that H(A) mod L = 7H(B) mod L = 234H(C) mod L = 447H(D) mod L = 660, and H(E) mod L = 875. We can now tell which server a key should live on. To do this, we map all keys to a server by seeing if it falls in the range between that server and the next one in the ring. For example, A is responsible for keys whose hash value falls in the range [7,233], and E is responsible for keys in the range [875, 6] (this range wraps around on itself at 1000). So if H('employee30') mod L = 899, it will be stored by server E, and if H('employee31') mod L = 234, it will be stored on server B.

Replicating Data

Replication for multi-server durability is achieved by passing the keys and values in one server’s assigned range to the servers following it in the ring. For example, with a replication factor of 3, keys mapped to the range [7,233] will be stored on servers AB, and C. If A were to fail, its neighbors B and C would take over its workload. In some designs, E would replicate and take over A‘s workload temporarily, since its range would expand to include A‘s.

Achieving Better Distribution

While hashing is statistically effective at uniformly distributing a keyspace, it usually requires many servers before it distributes evenly. Unfortunately, we often start with a small number of servers that are not perfectly spaced apart from one-another by the hash function. In our example, A‘s key range is of length 227, whereas E‘s range is 132. This leads to uneven load on different servers. It also makes it difficult for servers to take over for one-another when they fail, since a neighbor suddenly has to take control of the entire range of the failed server.

To solve the problem of uneven large key ranges, many DHTs including Riak create several `virtual’ nodes per physical machine. For example, with 4 virtual nodes, server A will act as server A_1A_2A_3, and A_4. Each virtual node hashes to a different value, giving it more opportunity to manage keys distributed to different parts of the keyspace. Voldemort takes a similar approach, in which the number of partitions is manually configured and usually larger than the number of servers, resulting in each server receiving a number of smaller partitions.

Cassandra does not assign multiple small partitions to each server, resulting in sometimes uneven key range distributions. For load-balancing, Cassandra has an asynchronous process which adjusts the location of servers on the ring depending on their historic load.

13.4.4. Range Partitioning

In the range partitioning approach to sharding, some machines in your system keep metadata about which servers contain which key ranges. This metadata is consulted to route key and range lookups to the appropriate servers. Like the consistent hash ring approach, this range partitioning splits the keyspace into ranges, with each key range being managed by one machine and potentially replicated to others. Unlike the consistent hashing approach, two keys that are next to each other in the key’s sort order are likely to appear in the same partition. This reduces the size of the routing metadata, as large ranges are compressed to [start, end] markers.

In adding active record-keeping of the range-to-server mapping, the range partitioning approach allows for more fine-grained control of load-shedding from heavily loaded servers. If a specific key range sees higher traffic than other ranges, a load manager can reduce the size of the range on that server, or reduce the number of shards that this server serves. The added freedom to actively manage load comes at the expense of extra architectural components which monitor and route shards.

The BigTable Way

Google’s BigTable paper describes a range-partitioning hierarchical technique for sharding data into tablets. A tablet stores a range of row keys and values within a column family. It maintains all of the necessary logs and data structures to answer queries about the keys in its assigned range. Tablet servers serve multiple tablets depending on the load each tablet is experiencing.

Each tablet is kept at a size of 100-200 MB. As tablets change in size, two small tablets with adjoining key ranges might be combined, or a large tablet might be split in two. A master server analyzes tablet size, load, and tablet server availability. The master adjusts which tablet server serves which tablets at any time.

[BigTable-based Range Partitioning]Figure 13.2: BigTable-based Range Partitioning

The master server maintains the tablet assignment in a metadata table. Because this metadata can get large, the metadata table is also sharded into tablets that map key ranges to tablets and tablet servers responsible for those ranges. This results in a three-layer hierarchy traversal for clients to find a key on its hosting tablet server, as depicted in Figure 13.2.

Let’s look at an example. A client searching for key 900 will query server A, which stores the tablet for metadata level 0. This tablet identifies the metadata level 1 tablet on server 6 containing key ranges 500-1500. The client sends a request to server B with this key, which responds that the tablet containing keys 850-950 is found on a tablet on server C. Finally, the client sends the key request to server C, and gets the row data back for its query. Metadata tablets at level 0 and 1 may be cached by the client, which avoids putting undue load on their tablet servers from repeat queries. The BigTable paper explains that this 3-level hierarchy can accommodate 261bytes worth of storage using 128MB tablets.

Handling Failures

The master is a single point of failure in the BigTable design, but can go down temporarily without affecting requests to tablet servers. If a tablet server fails while serving tablet requests, it is up to the master to recognize this and re-assign its tablets while requests temporarily fail.

In order to recognize and handle machine failures, the BigTable paper describes the use of Chubby, a distributed locking system for managing server membership and liveness. ZooKeeper17 is the open source implementation of Chubby, and several Hadoop-based projects utilize it to manage secondary master servers and tablet server reassignment.

Range Partitioning-based NoSQL Projects

HBase employs BigTable’s hierarchical approach to range-partitioning. Underlying tablet data is stored in Hadoop’s distributed filesystem (HDFS). HDFS handles data replication and consistency among replicas, leaving tablet servers to handle requests, update storage structures, and initiate tablet splits and compactions.

MongoDB handles range partitioning in a manner similar to that of BigTable. Several configuration nodes store and manage the routing tables that specify which storage node is responsible for which key ranges. These configuration nodes stay in sync through a protocol called two-phase commit, and serve as a hybrid of BigTable’s master for specifying ranges and Chubby for highly available configuration management. Separate routing processes, which are stateless, keep track of the most recent routing configuration and route key requests to the appropriate storage nodes. Storage nodes are arranged in replica sets to handle replication.

Cassandra provides an order-preserving partitioner if you wish to allow fast range scans over your data. Cassandra nodes are still arranged in a ring using consistent hashing, but rather than hashing a key-value pair onto the ring to determine the server to which it should be assigned, the key is simply mapped onto the server which controls the range in which the key naturally fits. For example, keys 20 and 21 would both be mapped to server A in our consistent hash ring in Figure 13.1, rather than being hashed and randomly distributed in the ring.

Twitter’s Gizzard framework for managing partitioned and replicated data across many back ends uses range partitioning to shard data. Routing servers form hierarchies of any depth, assigning ranges of keys to servers below them in the hierarchy. These servers either store data for keys in their assigned range, or route to yet another layer of routing servers. Replication in this model is achieved by sending updates to multiple machines for a key range. Gizzard routing nodes manage failed writes in different manner than other NoSQL systems. Gizzard requires that system designers make all updates idempotent (they can be run twice). When a storage node fails, routing nodes cache and repeatedly send updates to the node until the update is confirmed.

13.4.5. Which Partitioning Scheme to Use

Given the hash- and range-based approaches to sharding, which is preferable? It depends. Range partitioning is the obvious choice to use when you will frequently be performing range scans over the keys of your data. As you read values in order by key, you will not jump to random nodes in the network, which would incur heavy network overhead. But if you do not require range scans, which sharding scheme should you use?

Hash partitioning gives reasonable distribution of data across nodes, and random skew can be reduced with virtual nodes. Routing is simple in the hash partitioning scheme: for the most part, the hash function can be executed by clients to find the appropriate server. With more complicated rebalancing schemes, finding the right node for a key becomes more difficult.

Range partitioning requires the upfront cost of maintaining routing and configuration nodes, which can see heavy load and become central points of failure in the absence of relatively complex fault tolerance schemes. Done well, however, range-partitioned data can be load-balanced in small chunks which can be reassigned in high-load situations. If a server goes down, its assigned ranges can be distributed to many servers, rather than loading the server’s immediate neighbors during downtime.


13.5. Consistency

Having spoken about the virtues of replicating data to multiple machines for durability and spreading load, it’s time to let you in on a secret: keeping replicas of your data on multiple machines consistent with one-another is hard. In practice, replicas will crash and get out of sync, replicas will crash and never come back, networks will partition two sets of replicas, and messages between machines will get delayed or lost. There are two major approaches to data consistency in the NoSQL ecosystem. The first is strong consistency, where all replicas remain in sync. The second is eventual consistency, where replicas are allowed to get out of sync, but eventually catch up with one-another. Let’s first get into why the second option is an appropriate consideration by understanding a fundamental property of distributed computing. After that, we’ll jump into the details of each approach.

13.5.1. A Little Bit About CAP

Why are we considering anything short of strong consistency guarantees over our data? It all comes down to a property of distributed systems architected for modern networking equipment. The idea was first proposed by Eric Brewer as the CAP Theorem, and later proved by Gilbert and Lynch [GL02]. The theorem first presents three properties of distributed systems which make up the acronym CAP:

  • Consistency: do all replicas of a piece of data always logically agree on the same version of that data by the time you read it? (This concept of consistency is different than the C in ACID.)
  • Availability: Do replicas respond to read and write requests regardless of how many replicas are inaccessible?
  • Partition tolerance: Can the system continue to operate even if some replicas temporarily lose the ability to communicate with each other over the network?

The theorem then goes on to say that a storage system which operates on multiple computers can only achieve two of these properties at the expense of a third. Also, we are forced to implement partition-tolerant systems. On current networking hardware using current messaging protocols, packets can be lost, switches can fail, and there is no way to know whether the network is down or the server you are trying to send a message to is unavailable. All NoSQL systems should be partition-tolerant. The remaining choice is between consistency and availability. No NoSQL system can provide both at the same time.

Opting for consistency means that your replicated data will not be out of sync across replicas. An easy way to achieve consistency is to require that all replicas acknowledge updates. If a replica goes down and you can not confirm data updates on it, then you degrade availability on its keys. This means that until all replicas recover and respond, the user can not receive successful acknowledgment of their update operation. Thus, opting for consistency is opting for a lack of round-the-clock availability for each data item.

Opting for availability means that when a user issues an operation, replicas should act on the data they have, regardless of the state of other replicas. This may lead to diverging consistency of data across replicas, since they weren’t required to acknowledge all updates, and some replicas may have not noted all updates.

The implications of the CAP theorem lead to the strong consistency and eventual consistency approaches to building NoSQL data stores. Other approaches exist, such as the relaxed consistency and relaxed availability approach presented in Yahoo!’s PNUTS [CRS+08] system. None of the open source NoSQL systems we discuss has adopted this technique yet, so we will not discuss it further.

13.5.2. Strong Consistency

Systems which promote strong consistency ensure that the replicas of a data item will always be able to come to consensus on the value of a key. Some replicas may be out of sync with one-another, but when the user asks for the value of employee30:salary, the machines have a way to consistently agree on the value the user sees. How this works is best explained with numbers.

Say we replicate a key on N machines. Some machine, perhaps one of the N, serves as a coordinator for each user request. The coordinator ensures that a certain number of the N machines has received and acknowledged each request. When a write or update occurs to a key, the coordinator does not confirm with the user that the write occurred until W replicas confirm that they have received the update. When a user wants to read the value for some key, the coordinator responds when at least R have responded with the same value. We say that the system exemplifies strong consistency if R+W>N.

Putting some numbers to this idea, let’s say that we’re replicating each key across N=3 machines (call them AB, and C). Say that the key employee30:salary is initially set to the value $20,000, but we want to give employee30a raise to $30,000. Let’s require that at least W=2 of AB, or C acknowledge each write request for a key. When Aand B confirm the write request for (employee30:salary, $30,000), the coordinator lets the user know thatemployee30:salary is safely updated. Let’s assume that machine C never received the write request foremployee30:salary, so it still has the value $20,000. When a coordinator gets a read request for keyemployee30:salary, it will send that request to all 3 machines:

  • If we set R=1, and machine C responds first with $20,000, our employee will not be very happy.
  • However, if we set R=2, the coordinator will see the value from C, wait for a second response from A or B, which will conflict with C‘s outdated value, and finally receive a response from the third machine, which will confirm that $30,000 is the majority opinion.

So in order to achieve strong consistency in this case, we need to set R=2} so that R+W3}.

What happens when W replicas do not respond to a write request, or R replicas do not respond to a read request with a consistent response? The coordinator can timeout eventually and send the user an error, or wait until the situation corrects itself. Either way, the system is considered unavailable for that request for at least some time.

Your choice of R and W affect how many machines can act strangely before your system becomes unavailable for different actions on a key. If you force all of your replicas to acknowledge writes, for example, then W=N, and write operations will hang or fail on any replica failure. A common choice is R + W = N + 1, the minimum required for strong consistency while still allowing for temporary disagreement between replicas. Many strong consistency systems opt for W=N and R=1, since they then do not have to design for nodes going out of sync.

HBase bases its replicated storage on HDFS, a distributed storage layer. HDFS provides strong consistency guarantees. In HDFS, a write cannot succeed until it has been replicated to all N (usually 2 or 3) replicas, so W = N. A read will be satisfied by a single replica, so R = 1. To avoid bogging down write-intensive workloads, data is transferred from the user to the replicas asynchronously in parallel. Once all replicas acknowledge that they have received copies of the data, the final step of swapping the new data in to the system is performed atomically and consistently across all replicas.

13.5.3. Eventual Consistency

Dynamo-based systems, which include Voldemort, Cassandra, and Riak, allow the user to specify NR, and W to their needs, even if R + W <= N. This means that the user can achieve either strong or eventual consistency. When a user picks eventual consistency, and even when the programmer opts for strong consistency but W is less than N, there are periods in which replicas might not see eye-to-eye. To provide eventual consistency among replicas, these systems employ various tools to catch stale replicas up to speed. Let’s first cover how various systems determine that data has gotten out of sync, then discuss how they synchronize replicas, and finally bring in a few dynamo-inspired methods for speeding up the synchronization process.

Versioning and Conflicts

Because two replicas might see two different versions of a value for some key, data versioning and conflict detection is important. The dynamo-based systems use a type of versioning called vector clocks. A vector clock is a vector assigned to each key which contains a counter for each replica. For example, if servers AB, and Care the three replicas of some key, the vector clock will have three entries, (N_A, N_B, N_C), initialized to(0,0,0).

Each time a replica modifies a key, it increments its counter in the vector. If B modifies a key that previously had version (39, 1, 5), it will change the vector clock to (39, 2, 5). When another replica, say C, receives an update from B about the key’s data, it will compare the vector clock from B to its own. As long as its own vector clock counters are all less than the ones delivered from B, then it has a stale version and can overwrite its own copy with B‘s. If B and C have clocks in which some counters are greater than others in both clocks, say (39, 2, 5) and (39, 1, 6), then the servers recognize that they received different, potentially unreconcilable updates over time, and identify a conflict.

Conflict Resolution

Conflict resolution varies across the different systems. The Dynamo paper leaves conflict resolution to the application using the storage system. Two versions of a shopping cart can be merged into one without significant loss of data, but two versions of a collaboratively edited document might require human reviewer to resolve conflict. Voldemort follows this model, returning multiple copies of a key to the requesting client application upon conflict.

Cassandra, which stores a timestamp on each key, uses the most recently timestamped version of a key when two versions are in conflict. This removes the need for a round-trip to the client and simplifies the API. This design makes it difficult to handle situations where conflicted data can be intelligently merged, as in our shopping cart example, or when implementing distributed counters. Riak allows both of the approaches offered by Voldemort and Cassandra. CouchDB provides a hybrid: it identifies a conflict and allows users to query for conflicted keys for manual repair, but deterministically picks a version to return to users until conflicts are repaired.

Read Repair

If R replicas return non-conflicting data to a coordinator, the coordinator can safely return the non-conflicting data to the application. The coordinator may still notice that some of the replicas are out of sync. The Dynamo paper suggests, and Cassandra, Riak, and Voldemort implement, a technique called read repair for handling such situations. When a coordinator identifies a conflict on read, even if a consistent value has been returned to the user, the coordinator starts conflict-resolution protocols between conflicted replicas. This proactively fixes conflicts with little additional work. Replicas have already sent their version of the data to the coordinator, and faster conflict resolution will result in less divergence in the system.

Hinted Handoff

Cassandra, Riak, and Voldemort all employ a technique called hinted handoff to improve write performance for situations where a node temporarily becomes unavailable. If one of the replicas for a key does not respond to a write request, another node is selected to temporarily take over its write workload. Writes for the unavailable node are kept separately, and when the backup node notices the previously unavailable node become available, it forwards all of the writes to the newly available replica. The Dynamo paper utilizes a ‘sloppy quorum’ approach and allows the writes accomplished through hinted handoff to count toward the W required write acknowledgments. Cassandra and Voldemort will not count a hinted handoff against W, and will fail a write which does not have W confirmations from the originally assigned replicas. Hinted handoff is still useful in these systems, as it speeds up recovery when an unavailable node returns.


When a replica is down for an extended period of time, or the machine storing hinted handoffs for an unavailable replica goes down as well, replicas must synchronize from one-another. In this case, Cassandra and Riak implement a Dynamo-inspired process called anti-entropy. In anti-entropy, replicas exchange Merkle Trees to identify parts of their replicated key ranges which are out of sync. A Merkle tree is a hierarchical hash verification: if the hash over the entire keyspace is not the same between two replicas, they will exchange hashes of smaller and smaller portions of the replicated keyspace until the out-of-sync keys are identified. This approach reduces unnecessary data transfer between replicas which contain mostly similar data.


Finally, as distributed systems grow, it is hard to keep track of how each node in the system is doing. The three Dynamo-based systems employ an age-old high school technique known as gossip to keep track of other nodes. Periodically (every second or so), a node will pick a random node it once communicated with to exchange knowledge of the health of the other nodes in the system. In providing this exchange, nodes learn which other nodes are down, and know where to route clients in search of a key.


13.6. A Final Word

The NoSQL ecosystem is still in its infancy, and many of the systems we’ve discussed will change architectures, designs, and interfaces. The important takeaways in this chapter are not what each NoSQL system currently does, but rather the design decisions that led to a combination of features that make up these systems. NoSQL leaves a lot of design work in the hands of the application designer. Understanding the architectural components of these systems will not only help you build the next great NoSQL amalgamation, but also allow you to use current versions responsibly.


13.7. Acknowledgments

I am grateful to Jackie Carter, Mihir Kedia, and the anonymous reviewers for their comments and suggestions to improve the chapter. This chapter would also not be possible without the years of dedicated work of the NoSQL community. Keep building!




发表在 未分类 | 标签为 | 一条评论




曹 羽中 (caoyuz@cn.ibm.com), 软件工程师, IBM中国开发中心

简介: 本文全面系统地介绍了shell脚本调试技术,包括使用echo, tee, trap等命令输出关键信息,跟踪变量的值,在脚本中植入调试钩子,使用“-n”选项进行shell脚本的语法检查, 使用“-x”选项实现shell脚本逐条语句的跟踪,巧妙地利用shell的内置变量增强“-x”选项的输出信息等。

本文的标签:  shell, 脚本, 调试

发布日期: 2007 年 7 月 26 日
级别: 初级
访问情况 10487 次浏览
建议: 2 (查看或添加评论)

1 star2 stars3 stars4 stars5 stars 平均分 (共 15 个评分 )

一. 前言

shell编程在unix/linux世界中使用得非常广泛,熟练掌握shell编程也是成为一名优秀的unix/linux开发者和系统管理员的必经之 路。脚本调试的主要工作就是发现引发脚本错误的原因以及在脚本源代码中定位发生错误的行,常用的手段包括分析输出的错误信息,通过在脚本中加入调试语句, 输出调试信息来辅助诊断错误,利用调试工具等。但与其它高级语言相比,shell解释器缺乏相应的调试机制和调试工具的支持,其输出的错误信息又往往很不 明确,初学者在调试脚本时,除了知道用echo语句输出一些信息外,别无它法,而仅仅依赖于大量的加入echo语句来诊断错误,确实令人不胜其繁,故常见 初学者抱怨shell脚本太难调试了。本文将系统地介绍一些重要的shell脚本调试技术,希望能对shell的初学者有所裨益。

本文的目标读者是unix/linux环境下的开发人员,测试人员和系统管理员,要求读者具有基本的shell编程知识。本文所使用范例在 Bash3.1+Redhat Enterprise Server 4.0下测试通过,但所述调试技巧应也同样适用于其它shell。

二. 在shell脚本中输出调试信息

通过在程序中加入调试语句把一些关键地方或出错的地方的相关信息显示出来是最常见的调试手段。Shell程序员通常使用echo(ksh程序员常使用 print)语句输出信息,但仅仅依赖echo语句的输出跟踪信息很麻烦,调试阶段在脚本中加入的大量的echo语句在产品交付时还得再费力一一删除。针 对这个问题,本节主要介绍一些如何方便有效的输出调试信息的方法。

1. 使用trap命令

trap ‘command’ signal
其中signal是要捕获的信号,command是捕获到指定的信号之后,所要执行的命令。可以用kill –l命令看到系统中全部可用的信号名,捕获信号后所执行的命令可以是任何一条或多条合法的shell语句,也可以是一个函数名。
表 1. shell伪信号

信号名 何时产生
EXIT 从一个函数中退出或整个脚本执行完毕
ERR 当一条命令返回非零状态时(代表命令执行不成功)
DEBUG 脚本中每一条命令执行之前

trap ‘command’ EXIT 或 trap ‘command’ 0


$ cat -n exp1.sh
     1  ERRTRAP()
     2  {
     3    echo "[LINE:$1] Error: Command or function exited with status $?"
     4  }
     5  foo()
     6  {
     7    return 1;
     8  }
     9  trap 'ERRTRAP $LINENO' ERR
    10  abc
    11  foo


$ sh exp1.sh
exp1.sh: line 10: abc: command not found
[LINE:10] Error: Command or function exited with status 127
[LINE:11] Error: Command or function exited with status 1



$ cat –n exp2.sh
     1  #!/bin/bash
     2  trap 'echo “before execute line:$LINENO, a=$a,b=$b,c=$c”' DEBUG
     3  a=1
     4  if [ "$a" -eq 1 ]
     5  then
     6     b=2
     7  else
     8     b=1
     9  fi
    10  c=3
    11  echo "end"


$ sh exp2.sh
before execute line:3, a=,b=,c=
before execute line:4, a=1,b=,c=
before execute line:6, a=1,b=,c=
before execute line:10, a=1,b=2,c=
before execute line:11, a=1,b=2,c=3


2. 使用tee命令

在shell脚本中管道以及输入输出重定向使用得非常多,在管道的作用下,一些命令的执行结果直接成为了下一条命令的输入。如果我们发现由管道连接起来的 一批命令的执行结果并非如预期的那样,就需要逐步检查各条命令的执行结果来判断问题出在哪儿,但因为使用了管道,这些中间结果并不会显示在屏幕上,给调试 带来了困难,此时我们就可以借助于tee命令了。


ipaddr=`/sbin/ifconfig | grep 'inet addr:' | grep -v ''
| cut -d : -f3 | awk '{print $1}'`
echo $ipaddr


ipaddr=`/sbin/ifconfig | grep 'inet addr:' | grep -v ''
| tee temp.txt | cut -d : -f3 | awk '{print $1}'`
echo $ipaddr


$ cat temp.txt
inet addr:  Bcast:  Mask:

我们可以发现中间结果的第二列(列之间以:号分隔)才包含了IP地址,而在上面的脚本中使用cut命令截取了第三列,故我们只需将脚本中的cut -d : -f3改为cut -d : -f2即可得到正确的结果。

具体到上述的script例子,我们也许并不需要tee命令的帮助,比如我们可以分段执行由管道连接起来的各条命令并查看各命令的输出结果来诊断错误,但 在一些复杂的shell脚本中,这些由管道连接起来的命令可能又依赖于脚本中定义的一些其它变量,这时我们想要在提示符下来分段运行各条命令就会非常麻烦 了,简单地在管道之间插入一条tee命令来查看中间结果会更方便一些。

3. 使用”调试钩子”


if [ “$DEBUG” = “true” ]; then
echo “debugging”  #此处可以输出调试信息

这样的代码块通常称之为“调试钩子”或“调试块”。在调试钩子内部可以输出任何您想输出的调试信息,使用调试钩子的好处是它是可以通过DEBUG变量来控 制的,在脚本的开发调试阶段,可以先执行export DEBUG=true命令打开调试钩子,使其输出调试信息,而在把脚本交付使用时,也无需再费事把脚本中的调试语句一一删除。


$ cat –n exp3.sh
     1  DEBUG()
     2  {
     3  if [ "$DEBUG" = "true" ]; then
     4      $@  
     5  fi
     6  }
     7  a=1
     8  DEBUG echo "a=$a"
     9  if [ "$a" -eq 1 ]
    10  then
    11       b=2
    12  else
    13       b=1
    14  fi
    15  DEBUG echo "b=$b"
    16  c=3
    17  DEBUG echo "c=$c"



三. 使用shell的执行选项


-n 只读取shell脚本,但不实际执行
-x 进入跟踪方式,显示所执行的每一条命令
-c “string” 从strings中读取命令

“-n”可用于测试shell脚本是否存在语法错误,但不会实际执行命令。在shell脚本编写完成之后,实际执行之前,首先使用“-n”选项来测试脚本 是否存在语法错误是一个很好的习惯。因为某些shell脚本在执行时会对系统环境产生影响,比如生成或移动文件等,如果在实际执行才发现语法错误,您不得 不手工做一些系统环境的恢复工作才能继续测试这个脚本。

sh -c ‘a=1;b=2;let c=$a+$b;echo “c=$c”‘

“-x”选项可用来跟踪脚本的执行,是调试shell脚本的强有力工具。“-x”选项使shell在执行脚本的过程中把它实际执行的每一个命令行显示出来,并且在行首显示一个”+”号。 “+”号后面显示的是经过了变量替换之后的命令行的内容,有助于分析实际执行的是什么命令。 “-x”选项使用起来简单方便,可以轻松对付大多数的shell调试任务,应把其当作首选的调试手段。

如果把本文前面所述的trap ‘command’ DEBUG机制与“-x”选项结合起来,我们 就可以既输出实际执行的每一条命令,又逐行跟踪相关变量的值,对调试相当有帮助。


$ sh –x exp2.sh
+ trap 'echo "before execute line:$LINENO, a=$a,b=$b,c=$c"' DEBUG
++ echo 'before execute line:3, a=,b=,c='
before execute line:3, a=,b=,c=
+ a=1
++ echo 'before execute line:4, a=1,b=,c='
before execute line:4, a=1,b=,c=
+ '[' 1 -eq 1 ']'
++ echo 'before execute line:6, a=1,b=,c='
before execute line:6, a=1,b=,c=
+ b=2
++ echo 'before execute line:10, a=1,b=2,c='
before execute line:10, a=1,b=2,c=
+ c=3
++ echo 'before execute line:11, a=1,b=2,c=3'
before execute line:11, a=1,b=2,c=3
+ echo end


shell的执行选项除了可以在启动shell时指定外,亦可在脚本中用set命令来指定。 “set -参数”表示启用某选项,”set +参数”表示关闭某选项。有时候我们并不需要在启动时用”-x”选项来跟踪所有的命令行,这时我们可以在脚本中使用set命令,如以下脚本片段所示:

set -x    #启动"-x"选项
set +x     #关闭"-x"选项


DEBUG set -x    #启动"-x"选项
DEBUG set +x    #关闭"-x"选项


四. 对”-x”选项的增强

“-x”执行选项是目前最常用的跟踪和调试shell脚本的手段,但其输出的调试信息仅限于进行变量替换之后的每一条实际执行的命令以及行首的一个”+” 号提示符,居然连行号这样的重要信息都没有,对于复杂的shell脚本的调试来说,还是非常的不方便。幸运的是,我们可以巧妙地利用shell内置的一些 环境变量来增强”-x”选项的输出信息,下面先介绍几个shell内置的环境变量:


函数的名字,类似于C语言中的内置宏__func__,但宏__func__只能代表当前所在的函数名,而$FUNCNAME的功能更强大,它是一个数组 变量,其中包含了整个调用链上所有的函数的名字,故变量${FUNCNAME[0]}代表shell脚本当前正在执行的函数的名字,而变 量${FUNCNAME[1]}则代表调用函数${FUNCNAME[0]}的函数的名字,余者可以依此类推。

主提示符变量$PS1和第二级提示符变量$PS2比较常见,但很少有人注意到第四级提示符变量$PS4的作用。我们知道使用“-x”执行选项将会显示 shell脚本中每一条实际执行过的命令,而$PS4的值将被显示在“-x”选项输出的每一条命令的前面。在Bash Shell中,缺省的$PS4的值是”+”号。(现在知道为什么使用”-x”选项时,输出的命令前面有一个”+”号了吧?)。

利用$PS4这一特性,通过使用一些内置变量来重定义$PS4的值,我们就可以增强”-x”选项的输出信息。例如先执行export PS4=’+{$LINENO:${FUNCNAME[0]}} ‘, 然后再使用“-x”选项来执行脚本,就能在每一条实际执行的命令前面显示其行号以及所属的函数名。


$ cat –n exp4.sh
     1  #!/bin/bash
     2  isRoot()
     3  {
     4          if [ "$UID" -ne 0 ]
     5                  return 1
     6          else
     7                  return 0
     8          fi
     9  }
    10  isRoot
    11  if ["$?" -ne 0 ]
    12  then
    13          echo "Must be root to run this script"
    14          exit 1
    15  else
    16          echo "welcome root user"
    17          #do something
    18  fi

首先执行sh –n exp4.sh来进行语法检查,输出如下:

$ sh –n exp4.sh
exp4.sh: line 6: syntax error near unexpected token `else'
exp4.sh: line 6: `      else'

发现了一个语法错误,通过仔细检查第6行前后的命令,我们发现是第4行的if语句缺少then关键字引起的(写惯了C程序的人很容易犯这个错误)。我们可 以把第4行修改为if [ “$UID” -ne 0 ]; then来修正这个错误。再次运行sh –n exp4.sh来进行语法检查,没有再报告错误。接下来就可以实际执行这个脚本了,执行结果如下:

$ sh exp4.sh
exp2.sh: line 11: [1: command not found
welcome root user

尽管脚本没有语法错误了,在执行时却又报告了错误。错误信息还非常奇怪“[1: command not found”。现在我们可以试试定制$PS4的值,并使用“-x”选项来跟踪:

$ export PS4='+{$LINENO:${FUNCNAME[0]}} '
$ sh –x exp4.sh
+{10:} isRoot
+{4:isRoot} '[' 503 -ne 0 ']'
+{5:isRoot} return 1
+{11:} '[1' -ne 0 ']'
exp4.sh: line 11: [1: command not found
+{16:} echo 'welcome root user'
welcome root user


+{4:isRoot} '[' 503 -ne 0 ']'
+{11:} '[1' -ne 0 ']'

可知由于第11行的[号后面缺少了一个空格,导致[号与紧挨它的变量$?的值1被shell解释器看作了一个整体,并试着把这个整体视为一个命令来执行,故有“[1: command not found”这样的错误提示。只需在[号后面插入一个空格就一切正常了。

shell中还有其它一些对调试有帮助的内置变量,比如在Bash Shell中还有BASH_SOURCE, BASH_SUBSHELL等一批对调试有帮助的内置变量,您可以通过man sh或man bash来查看,然后根据您的调试目的,使用这些内置变量来定制$PS4,从而达到增强“-x”选项的输出信息的目的。


五. 总结

首先使用“-n”选项检查语法错误,然后使用“-x”选项跟踪脚本的执行,使用“-x”选项之前,别忘了先定制PS4变量的值来增强“-x”选项的输出信 息,至少应该令其输出行号信息(先执行export PS4=’+[$LINENO]’,更一劳永逸的办法是将这条语句加到您用户主目录的.bash_profile文件中去),这将使你的调试之旅更轻松。 也可以利用trap,调试钩子等手段输出关键调试信息,快速缩小排查错误的范围,并在脚本中使用“set -x”及“set +x”对某些代码块进行重点跟踪。这样多种手段齐下,相信您已经可以比较轻松地抓出您的shell脚本中的臭虫了。如果您的脚本足够复杂,还需要更强的调 试能力,可以使用shell调试器bashdb,这是一个类似于GDB的调试工具,可以完成对shell脚本的断点设置,单步执行,变量观察等许多功能, 使用bashdb对阅读和理解复杂的shell脚本也会大有裨益。关于bashdb的安装和使用,不属于本文范围,您可参阅 http://bashdb.sourceforge.net/上的文档并下载试用。



曹 羽中,在北京航空航天大学获得计算机软件与理论专业的硕士学位,具有数年的 unix 环境下的 C 语言,Java,数据库以及电信计费软件的开发经验,他的技术兴趣还包括 OSGi 和搜索技术。他目前在IBM中国系统与科技实验室从事系统管理软件的开发工作,可以通过 caoyuz@cn.ibm.com与他联系。

发表在 未分类 | 留下评论