Read Article: â€å“the Discipline of Business Experimentationã¢â‚¬â (Thomke and Manzi, Hbr, Dec. 2014)
Idea in Brief
The Problem
In the absence of sufficient data to inform decisions about proposed innovations, managers often rely on their experience, intuition, or conventional wisdom—none of which is necessarily relevant.
The Solution
A rigorous scientific examination, in which companies separate an independent variable (the presumed cause) from a dependent variable (the observed effect) while holding all other potential causes constant, and then manipulate the onetime to study changes in the latter.
The Guidance
To make the well-nigh of their experiments, companies must ask: Does the experiment take a clear purpose? Take stakeholders fabricated a commitment to abide by the results? Is the experiment achievable? How can nosotros ensure reliable results? Take we gotten the most value out of the experiment?
Soon after Ron Johnson left Apple to become the CEO of J.C. Penney, in 2011, his team implemented a bold program that eliminated coupons and clearance racks, filled stores with branded boutiques, and used engineering to eliminate cashiers, cash registers, and checkout counters. Yet simply 17 months afterward Johnson joined Penney, sales had plunged, losses had soared, and Johnson had lost his chore. The retailer then did an about-face.
How could Penney have gone and then wrong? Didn't information technology have tons of transaction data revealing customers' tastes and preferences?
Presumably information technology did, only the trouble is that big data can provide clues only virtually the past behavior of customers—non about how they volition react to bold changes. When it comes to innovation, and so, almost managers must operate in a world where they lack sufficient data to inform their decisions. Consequently, they frequently rely on their experience or intuition. Just ideas that are truly innovative—that is, those that can reshape industries—typically go confronting the grain of executive experience and conventional wisdom.
Managers tin can, however, observe whether a new product or business programme will succeed by subjecting information technology to a rigorous test. Think of information technology this style: A pharmaceutical company would never introduce a drug without beginning conducting a round of experiments based on established scientific protocols. (In fact, the U.Due south. Nutrient and Drug Assistants requires extensive clinical trials.) Nevertheless that'south essentially what many companies do when they roll out new business organisation models and other novel concepts. Had J.C. Penney done thorough experiments on its CEO's proposed changes, the visitor might have discovered that customers would probably reject them.
Why don't more companies conduct rigorous tests of their risky overhauls and expensive proposals? Considering most organizations are reluctant to fund proper business experiments and have considerable difficulty executing them. Although the procedure of experimentation seems straightforward, it is surprisingly hard in exercise, owing to myriad organizational and technical challenges. That is the overarching decision of our 40-plus years of collective experience conducting and studying business experiments at dozens of companies, including Banking concern of America, BMW, Hilton, Kraft, Petco, Staples, Subway, and Walmart.
Experiment: Kohl's
The retailer set out to test the hypothesis that opening stores an hr later would non lead to a significant drop in sales.
Running a standard A/B test over a direct channel such as the internet—comparing, for case, the response rate to version A of a spider web page with the response charge per unit to version B—is a relatively simple practise using math adult a century ago. But the vast majority (more xc%) of consumer business is conducted through more-complex distribution systems, such equally shop networks, sales territories, banking company branches, fast-nutrient franchises, and and so on. Business experimentation in such environments suffers from a multifariousness of analytical complexities, the well-nigh important of which is that sample sizes are typically as well small to yield valid results. Whereas a large online retailer can only select fifty,000 consumers in a random manner and decide their reactions to an experimental offering, even the largest brick-and-mortar retailers can't randomly assign 50,000 stores to test a new promotion. For them, a realistic exam group usually numbers in the dozens, non the thousands. Indeed, we have plant that most tests of new consumer programs are too breezy. They are non based on proven scientific and statistical methods, and and then executives cease upward misinterpreting statistical racket as causation—and making bad decisions.
In an ideal experiment the tester separates an independent variable (the presumed crusade) from a dependent variable (the observed outcome) while holding all other potential causes constant, and then manipulates the former to study changes in the latter. The manipulation, followed past careful observation and analysis, yields insight into the relationships betwixt crusade and effect, which ideally can be applied to and tested in other settings.
To obtain that kind of cognition—and ensure that concern experimentation is worth the expense and try—companies need to ask themselves several crucial questions: Does the experiment have a clear purpose? Have stakeholders made a commitment to abide past the results? Is the experiment doable? How tin we ensure reliable results? Have we gotten the virtually value out of the experiment? Although those questions seem obvious, many companies brainstorm conducting tests without fully addressing them.
Does the Experiment Have a Clear Purpose?
Companies should conduct experiments if they are the only applied way to answer specific questions almost proposed direction actions.
Consider Kohl's, the large retailer, which in 2013 was looking for means to decrease its operating costs. One proffer was to open stores an 60 minutes later on Monday through Sat. Company executives were split on the matter. Some argued that reducing the stores' hours would event in a significant drop in sales; others claimed that the impact on sales would be minimal. The only way to settle the debate with any certainty was to behave a rigorous experiment. A test involving 100 of the company'due south stores showed that the delayed opening would non effect in any meaningful sales decline.
In determining whether an experiment is needed, managers must first figure out exactly what they want to learn. Only and so can they decide if testing is the best approach and, if information technology is, the scope of the experiment. In the example of Kohl's, the hypothesis to exist tested was straightforward: Opening stores an hour later to reduce operating costs will not pb to a significant drop in sales. All too often, though, companies lack the discipline to hone their hypotheses, leading to tests that are inefficient, unnecessarily costly, or, worse, ineffective in answering the question at manus. A weak hypothesis (such equally "We can extend our brand upmarket") doesn't present a specific contained variable to test on a specific dependent variable, so it is difficult either to support or to reject. A practiced hypothesis helps delineate those variables.
Farther Reading
-
How to Blueprint Smart Business Experiments
Experimentation Article
The real payoff will happen when an arrangement shifts to a test-and-larn heed-prepare.
- Save
In many situations executives demand to go beyond the direct furnishings of an initiative and investigate its ancillary furnishings. For example, when Family unit Dollar wanted to determine whether to invest in refrigeration units and then that it could sell eggs, milk, and other perishables, it discovered that a side upshot—the increment in the sales of traditional dry appurtenances to the additional customers drawn to the stores by the refrigerated items—would actually have a bigger impact on profits. Ancillary effects can also be negative. A few years agone, Wawa, the convenience store chain in the mid-Atlantic United States, wanted to introduce a flatbread breakfast detail that had done well in spot tests. Just the initiative was killed before the launch, when a rigorous experiment—consummate with test and command groups followed by regression analyses—showed that the new production would probable cannibalize other more profitable items.
Have Stakeholders Fabricated a Commitment to Abide by the Results?
Before conducting any exam, stakeholders must concur how they'll proceed one time the results are in. They should hope to counterbalance all the findings instead of cherry-red-picking data that supports a particular bespeak of view. Possibly near important, they must be willing to walk away from a projection if information technology's not supported by the data.
When Kohl's was considering adding a new product category, furniture, many executives were tremendously enthusiastic, anticipating significant additional acquirement. A test at 70 stores over six months, nevertheless, showed a net subtract in revenue. Products that at present had less floor space (to make room for the furniture) experienced a drop in sales, and Kohl'south was really losing customers overall. Those negative results were a huge thwarting for those who had advocated for the initiative, just the programme was nevertheless scrapped. The Kohl'south case highlights the fact that experiments are oft needed to perform objective assessments of initiatives backed past people with organizational ascendancy.
Of course, in that location might be proficient reasons for rolling out an initiative even when the anticipated benefits are not supported past the data—for instance, a program that experiments have shown will not substantially boost sales might withal be necessary to build customer loyalty. But if the proposed initiative is a done deal, why go through the time and expense of conducting a test?
A process should be instituted to ensure that test results aren't ignored, fifty-fifty when they contradict the assumptions or intuition of top executives. At Publix Super Markets, a concatenation in the southeastern United states, virtually all large retail projects, especially those requiring considerable majuscule expenditures, must undergo formal experiments to receive a greenish light. Proposals go through a filtering procedure in which the kickoff stride is for finance to perform an analysis to make up one's mind if an experiment is worth conducting.
For projects that brand the cutting, analytics professionals develop exam designs and submit them to a commission that includes the vice president of finance. The experiments approved by the committee are then conducted and overseen by an internal test group. Finance will approve significant expenditures only for proposed initiatives that take adhered to this procedure and whose experiment results are positive. "Projects get reviewed and canonical much more quickly—and with less scrutiny—when they have our test results to dorsum them," says Frank Maggio, the senior managing director of business analysis at Publix.
When amalgam and implementing such a filtering process, information technology is important to remember that experiments should be part of a learning agenda that supports a firm's organizational priorities. At Petco each test request must address how that particular experiment would contribute to the company's overall strategy to get more than innovative. In the by the company performed about 100 tests a year, but that number has been trimmed to 75. Many test requests are denied because the visitor has washed a similar exam in the past; others are rejected considering the changes nether consideration are not radical enough to justify the expense of testing (for example, a price increase of a unmarried item from $2.79 to $2.89). "We want to test things that will abound the business," says John Rhoades, the company's former managing director of retail analytics. "We want to try new concepts or new ideas."
Is the Experiment Doable?
Experiments must have testable predictions. But the "causal density" of the business surround—that is, the complexity of the variables and their interactions—can make information technology extremely difficult to make up one's mind crusade-and-upshot relationships. Learning from a business experiment is not necessarily as like shooting fish in a barrel every bit isolating an independent variable, manipulating it, and observing changes in the dependent variable. Environments are constantly changing, the potential causes of business outcomes are often uncertain or unknown, and so linkages between them are frequently complex and poorly understood.
This article likewise appears in:
Consider a hypothetical retail chain that has ten,000 convenience stores, 8,000 of which are named QwikMart and 2,000 FastMart. The QwikMart stores have been averaging $one million in annual sales and the FastMart stores $1.i million. A senior executive asks a seemingly simple question: Would changing the name of the QwikMart stores to FastMart lead to an increase in revenue of $800 1000000? Obviously, numerous factors affect store sales, including the physical size of the store, the number of people who live inside a certain radius and their average incomes, the number of hours the store is open per week, the experience of the store manager, the number of nearby competitors, then on. Merely the executive is interested in but one variable: the stores' name (QwikMart versus FastMart).
The obvious solution is to carry an experiment by irresolute the name of a handful of QwikMart stores (say, 10) to see what happens. Simply even determining the effect of the name alter on those stores turns out to be tricky, because many other variables may have inverse at the same time. For example, the atmospheric condition was very bad at four of the locations, a manager was replaced in i, a large residential building opened near another, and a competitor started an ambitious advertisement promotion near yet another. Unless the visitor can isolate the effect of the name alter from those and other variables, the executive won't know for certain whether the name change has helped (or injure) business.
To deal with environments of high causal density, companies need to consider whether it's feasible to employ a sample large enough to average out the effects of all variables except those beingness studied. Unfortunately, that type of experiment is non always doable. The cost of a examination involving an acceptable sample size might be prohibitive, or the change in operations could be also disruptive. In such instances, equally we discuss subsequently, executives can sometimes employ sophisticated belittling techniques, some involving big data, to increase the statistical validity of their results.
That said, it should exist noted that managers often mistakenly assume that a larger sample will automatically pb to better data. Indeed, an experiment tin can involve a lot of observations, but if they are highly clustered, or correlated to one another, then the true sample size might actually exist quite small-scale. When a company uses a distributor instead of selling directly to customers, for example, that distribution bespeak could easily pb to correlations amidst client data.
The required sample size depends in big part on the magnitude of the expected effect. If a visitor expects the cause (for example, a modify in store name) to have a large effect (a substantial increment in sales), the sample can exist smaller. If the expected effect is pocket-size, the sample must exist larger. This might seem counterintuitive, but think of it this fashion: The smaller the expected event, the greater the number of observations that are required to detect information technology from the surrounding noise with the desired statistical conviction.
Selecting the correct sample size does more ensure that the results will be statistically valid; it can also enable a company to decrease testing costs and increase innovation. Readily available software programs can help companies cull the optimal sample size. (Full disclosure: Jim Manzi's firm, Applied Predictive Technologies, sells one, Test & Acquire.)
How Can We Ensure Reliable Results?
In the previous section we described the basics for conducting an experiment. Nevertheless, the truth is that companies typically accept to make trade-offs betwixt reliability, toll, time, and other practical considerations. 3 methods can aid reduce the trade-offs, thus increasing the reliability of the results.
Randomized field trials.
The concept of randomization in medical research is simple: Take a large group of individuals with the same characteristics and affliction, and randomly divide them into two subgroups. Administer the treatment to simply one subgroup and closely monitor everyone's health. If the treated (or examination) group does statistically better than the untreated (or control) group, and so the therapy is deemed to exist effective. Similarly, randomized field trials can help companies determine whether specific changes will lead to improved performance.
The financial services company Capital Ane has long used rigorous experiments to test even the most seemingly lilliputian changes. Through randomized field trials, for instance, the visitor might test the color of the envelopes used for production offers by sending out two batches (one in the examination color and the other in white) to determine any differences in response.
Randomization plays an important function: It helps forbid systemic bias, introduced consciously or unconsciously, from affecting an experiment, and it evenly spreads whatever remaining (and possibly unknown) potential causes of the outcome between the test and control groups. Merely randomized field tests are not without challenges. For the results to exist valid, the field trials must be conducted in a statistically rigorous fashion.
Instead of identifying a population of exam subjects with the same characteristics then randomly dividing it into 2 groups, managers sometimes make the mistake of selecting a test group (say, a group of stores in a concatenation) then assuming that everything else (the remainder of the stores) should be the control group. Or they select the test and control groups in ways that inadvertently innovate biases into the experiment. Petco used to select its thirty all-time stores to attempt out a new initiative (every bit a exam group) and compare them with its xxx worst stores (as the control grouping). Initiatives tested in this way would often look very promising but fail when they were rolled out.
Now Petco considers a wide range of parameters—shop size, customer demographics, the presence of nearby competitors, so on—to lucifer the characteristics of the control and test groups. (Publix does the same.) The results from those experiments have been much more reliable.
Bullheaded tests.
To minimize biases and increment reliability further, Petco and Publix have conducted "bullheaded" tests, which assistance preclude the Hawthorne issue: the trend of report participants to modify their beliefs, consciously or subconsciously, when they are enlightened that they are role of an experiment. At Petco none of the test stores' staffers know when experiments are under way, and Publix conducts bullheaded tests whenever it can. For simple tests involving price changes, Publix tin can apply bullheaded procedures because stores are continually rolling out new prices, and so the tests are indistinguishable from normal operating practices.
Experiment: Wawa
A new flatbread did well in spot tests, but the chain killed it later rigorous experiments revealed information technology cannibalized other products.
Simply blind procedures are non e'er practical. For tests of new equipment or piece of work practices, Publix typically informs the stores that take been selected for the examination group. (Note: A higher experimental standard is the utilize of "double-bullheaded" tests, in which neither the experimenters nor the test subjects are aware of which participants are in the test grouping and which are in the command. Double-blind tests are widely used in medical research but are not commonplace in business experimentation.)
Big information.
In online and other direct-channel environments, the math required to comport a rigorous randomized experiment is well known. Just as nosotros discussed before, the vast majority of consumer transactions occur in other channels, such equally retail stores. In tests in such environments, sample sizes are often smaller than 100, violating typical assumptions of many standard statistical methods. To minimize the effects of this limitation, companies can employ specialized algorithms in combination with multiple sets of large data.
Consider a big retailer contemplating a shop redesign that was going to cost a half-billion dollars to gyre out to 1,300 locations. To test the idea, the retailer redesigned xx stores and tracked the results. The finance team analyzed the data and ended that the upgrade would increment sales by a meager 0.v%, resulting in a negative return on investment. The marketing squad conducted a separate assay and forecast that the redesign would lead to a healthy 5% sales increase.
As it turned out, the finance team had compared the examination sites with other stores in the chain that were of similar size, demographic income, and other variables but were not necessarily in the same geographic market. It had too used data vi months before and after the redesign. In contrast, the marketing team had compared stores within the same geographic region and had considered information 12 months earlier and later the redesign. To decide which results to trust, the company employed big data, including transaction-level data (store items, the times of day when the sale occurred, prices), store attributes, and data on the environments around the stores (contest, demographics, weather). In this way, the company selected stores for the control grouping that were a closer match with those in which the redesign was tested, which made the small sample size statistically valid. It then used objective, statistical methods to review both analyses. The results: The marketing team's findings were the more than accurate of the 2.
Fifty-fifty when a company tin can't follow a rigorous testing protocol, analysts tin assist identify and correct for certain biases, randomization failures, and other experimental imperfections. A common situation is when an organization'due south testing part is presented with nonrandomized natural experiments—the vice president of operations, for example, might want to know if the company's new employee grooming plan, which was introduced in about 10% of the company's markets, is more than effective than the sometime ane. As it turns out, in such situations the same algorithms and big information sets that can exist used to accost the problem of small-scale or correlated samples can also be deployed to tease out valuable insights and minimize uncertainty in the results. The assay can and then assistance experimenters design a truthful randomized field trial to confirm and refine the results, particularly when they are somewhat counterintuitive or are needed to inform a decision with large economic stakes.
For any experiment, the gilded standard is repeatability; that is, others conducting the aforementioned test should obtain similar results. Repeating an expensive test is usually impractical, but companies tin can verify results in other ways. Petco sometimes deploys a staged rollout for big initiatives to confirm the results before proceeding with a companywide implementation. And Publix has a process for tracking the results of a rollout and comparison them with the predicted benefit.
Have We Gotten the Virtually Value out of the Experiment?
Many companies go through the expense of conducting experiments merely then fail to make the virtually of them. To avert that mistake, executives should take into account a proposed initiative'south outcome on various customers, markets, and segments and concentrate investments in areas where the potential paybacks are highest. The correct question is usually not, What works? but, What works where?
Petco frequently rolls out a program only in stores that are well-nigh like to the test stores that had the best results. Past doing so, Petco non only saves on implementation costs merely too avoids involving stores where the new program might not deliver benefits or might fifty-fifty have negative consequences. Thanks to such targeted rollouts, Petco has consistently been able to double the predicted benefits of new initiatives.
Some other useful tactic is "value engineering." Nearly programs accept some components that create benefits in backlog of costs and others that do not. The trick, and so, is to implement just the components with an bonny return on investment (ROI). Every bit a unproblematic example, let's say that a retailer'south tests of a 20%-off promotion bear witness a five% lift in sales. What portion of that increase was due to the offer itself and what resulted from the accompanying advertising and grooming of store staff, both of which directed customers to those item sales products? In such cases, companies tin conduct experiments to investigate diverse combinations of components (for case, the promotional offer with advertising but without additional staff preparation). An analysis of the results tin disentangle the effects, allowing executives to drib the components (say, the additional staff training) that have a depression or negative ROI.
Moreover, a careful analysis of data generated by experiments tin enable companies to amend understand their operations and test their assumptions of which variables cause which effects. With big data, the emphasis is on correlation—discovering, for case, that sales of certain products tend to coincide with sales of others. Simply business experimentation tin allow companies to expect across correlation and investigate causality—uncovering, for example, the factors causing the increment (or decrease) of purchases. Such central cognition of causality tin be crucial. Without information technology, executives have only a bitty agreement of their businesses, and the decisions they make tin easily backfire.
Best Practice: Petco
The specialty retailer ensures reliable results from its experiments by matching the characteristics of the control and test groups.
When Cracker Barrel Old Country Store, the Southern-themed eating house chain, conducted an experiment to make up one's mind whether information technology should switch from incandescent to LED lights at its restaurants, executives were astonished to acquire that customer traffic really decreased in the locations that installed LED lights. The lighting initiative could take stopped there, but the visitor dug deeper to understand the underlying causes. Every bit information technology turned out, the new lighting fabricated the front porches of the restaurants look dimmer, and many customers mistakenly thought that the restaurants were closed. This was puzzling—the LEDs should take fabricated the porches brighter. Upon further investigation, executives learned that the store managers hadn't previously been following the company'southward lighting standards; they had been making their own adjustments, often adding extra lighting on the front porches. And then the luminosity dropped when the stores adhered to the new LED policy. The point here is that correlation lonely would have left the visitor with the incorrect impression—that LEDs are bad for business concern. It took experimentation to uncover the bodily causal human relationship.
Indeed, without fully understanding causality, companies leave themselves open to making big mistakes. Call up the experiment Kohl'southward did to investigate the effects of delaying the opening of its stores? During that testing, the visitor suffered an initial drop in sales. At that point, executives could have pulled the plug on the initiative. But an analysis showed that the number of customer transactions had remained the same; the issue was a drop in units per transaction. Eventually, the units per transaction recovered and total sales returned to previous levels. Kohl'south couldn't fully explicate the initial decrease, but executives resisted the temptation to arraign the reduced operating hours. They didn't rush to equate correlation with causation.
What'due south important hither is that many companies are discovering that conducting an experiment is merely the beginning. Value comes from analyzing and and then exploiting the data. In the past, Publix spent eighty% of its testing time gathering data and 20% analyzing it. The visitor's electric current goal is to reverse that ratio.
Challenging Conventional Wisdom
By paying attention to sample sizes, control groups, randomization, and other factors, companies tin ensure the validity of their exam results. The more valid and repeatable the results, the better they will hold up in the face up of internal resistance, which tin can exist especially strong when the results challenge long-continuing industry practices and conventional wisdom.
Further Reading
-
A Step-past-Pace Guide to Smart Business Experiments
Customer Service Article
It's an important skill whatever manager can aquire.
- Relieve
When Petco executives investigated new pricing for a production sold past weight, the results were unequivocal. By far, the best price was for a quarter pound of the product, and that price was for an corporeality that ended in $.25. That result went sharply confronting the grain of conventional wisdom, which typically calls for prices ending in 9, such as $4.99 or $2.49. "This bankrupt a dominion in retailing that y'all tin't have an 'ugly' price," notes Rhoades. At first, executives at Petco were skeptical of the results, but because the experiment had been conducted so rigorously, they eventually were willing to give the new pricing a try. A targeted rollout confirmed the results, leading to a sales jump of more than 24% later on six months.
The lesson is non merely that business organization experimentation can lead to better ways of doing things. Information technology can also give companies the confidence to overturn wrongheaded conventional wisdom and the faulty business organisation intuition that fifty-fifty seasoned executives tin can display. And smarter determination making ultimately leads to improved performance.
Could J.C. Penney have averted disaster by rigorously testing the components of its overhaul? At this point, it'due south incommunicable to know. Simply one thing's for certain: Before attempting to implement such a bold programme, the company needed to brand sure that knowledge—not just intuition—was guiding the decision.
A version of this article appeared in the December 2014 issue of Harvard Business concern Review.
Source: https://hbr.org/2014/12/the-discipline-of-business-experimentation
Post a Comment for "Read Article: â€å“the Discipline of Business Experimentationã¢â‚¬â (Thomke and Manzi, Hbr, Dec. 2014)"