Reliability data analysis
Trends in the statistical assessment of product reliability
by William Q. Meeker
Traditional reliability data has consisted of failure times for units that failed and running times for units that had not failed. Methods for analyzing such right-censored data (nonparametric estimation and maximum likelihood) were developed in the 1950s and the 1960s and became well-known to most statisticians by the 1970s. Interestingly (as can be seen by reading old papers in the engineering literature), there was a long period of time when many engineers thought that it was necessary to make all units fail so that the life data could be analyzed! Statistical methods for the analysis of censored data began to appear in commercial software by the mid-1980s (starting with SAS®) and are commonplace today.
The most popular tool for life data analysis is the probability plot, used to assess distribution goodness of fit, detect data anomalies, and to display the results of fitting parametric distributions. It should be more widely recognized that this is a valuable tool for general distributional analysis and does not require censoring for it to be useful! A community of engineers has long championed what has been called “Weibull analysis,” which implies fitting a Weibull distribution to failure data.
But the Weibull distribution is not always the appropriate distribution to use, and modern software allows fitting a number of different parametric distributions. The vast majority of applications in reliability, however, use either the Weibull or lognormal distribution. One reason for this is that there are strong mechanistic justifications that suggest these distributions, much as the central limit theorem can sometimes be used to explain why some random variables should be well-described by a normal distribution.
Multiple failure modes
For some applications it is important to distinguish among different product failure modes. In particular, it is important to do analyses that account for different failure modes when the failure modes behave differently (e.g., when both infant mortality and wear-out are causing product failures) or when there is need to assess the effect of or to make decisions about design changes that affect failure modes differently. When failure mode information is available for all failed units and when the different failure modes can be assumed to be statistically independent, the analysis of multiple failure mode data is, technically, not much more difficult than it is for analyzing a single failure mode. The analysis is, however, greatly facilitated when software tools have been designed to make the needed operations easy. Today, several statistical packages provide capabilities for estimating separate distributions for each failure mode and making assessments of improvement in product reliability by eliminating one or more of the failure modes.
When failure-mode information is missing for some or all of the failed units or when the failure modes cannot be described by the simple independence model, the analysis is more complicated, and special methods, not generally available, have to be employed.
Field and warranty data
Although laboratory reliability testing is often used to make product design decisions, the “real” reliability data comes from the field, often in the form of warranty returns. Warranty databases were initially created for financial-reporting purposes, but more and more companies are finding that warranty data is a rich source of reliability information. Perhaps six to eight months after a product has been introduced to the market, managers begin to ask about warranty costs over the life cycle of the product.
There are often large gaps between predictions made from product design models (supplemented by limited reliability testing) and reality. These differences are often caused by unanticipated failure modes. Algorithms for early detection of emerging reliability issues are being implemented in software and have the potential to save companies large amounts of money. Once a new emerging issue has been identified, statistical methods can be used to produce forecasts for the additional warranty costs.
Degradation reliability data
In modern high-reliability applications, we might not expect to see failures in our reliability testing, resulting in limited reliability information for product design. While visiting Bell Laboratories in the late 1970s and 1980s, I began to see engineers in telecommunications reliability applications collecting what we called “degradation data.” In some cases engineers were recording degradation as the natural response but turning the responses into failure data for analysis (presumably because all of the textbooks and software at the time dealt only with the analysis life data). But the small number of failures in these data sets provided only limited reliability information.
Today the term “degradation” refers to either performance degradation (e.g., light output from an LED) or some measure of actual chemical degradation (e.g., concentration of a harmful chemical compound). Over the past 30 years, I have had the opportunity to work on a number of different kinds of applications where degradation data was available. It is not always possible to find a degradation variable that corresponds to a failure mode of concern. When an appropriate degradation variable can be measured, degradation data, when properly analyzed, can provide much more information because there are quantitative measurements on all units (not just those that failed). Indeed, it is possible to make powerful reliability inferences from degradation data even when there are no failures!
Since the 1990s, statistical methods have been developed for making reliability inferences from degradation data. Initially these were developed by researchers or engineers in need of the methods. Statistical methods for the analysis of degradation data are, however, now beginning to be deployed in commercial statistical software.
The next generation of reliability data
The next generation of reliability field data will be even richer in information. Use rate and environmental conditions are important sources of variability in product lifetimes. The most important differences between carefully controlled laboratory accelerated test experiments and field reliability results are due to uncontrolled field variation (unit-to-unit and temporal) in variables like use rate, load, vibration, temperature, humidity, UV intensity and UV spectrum. Historically, use rate/environmental data has, in most applications, not been available to reliability analysts. Incorporating use rate/environmental data into our analyses will provide stronger statistical methods.
Today it is possible to install sensors and smart chips in a product to measure and record use rate/environmental data over the life of the product. In addition to the time series use rate/environmental data, we also can expect to see further developments in sensors that will provide information, at the same rate, on degradation or indicators of eminent failure. Depending on the application, such information is also called “system health” and “materials state” information.
In some applications (e.g., aircraft engines and power distribution transformers), system health/use rate/environmental data from a fleet of products in the field can be returned to a central location for real-time process monitoring and especially for prognostic purposes. An appropriate signal in this data might provoke rapid action to avoid a serious system failure (e.g., by reducing the load on an unhealthy transformer). Also, should some issue relating to system health arise at a later date, it would be possible to sort through historical data that has been collected to see if there might have been a detectable signal that could be used in the future to provide an early warning of the problem.
In products that are attached to the internet (e.g., computers and high-end printers), such use rate/environmental data can, with the owner’s permission, be downloaded periodically. In some cases use/environmental data will be available on units only when they are returned for repair (although monitoring at least a sample of units to get information on un-failed units would be statistically important).
The future possibilities for using use rate/ environmental data in reliability applications are unbounded. Lifetime models that use rate/environmental data have the potential to explain much more variability in field data than has been possible before. The information also can be used to predict the future environment lifetimes of individual units. This knowledge can, in turn, provide more precise estimates of the lifetime of individual products. As the cost of technology drops, cost-benefit ratios will decrease, and applications will spread.
Reliability is really an engineering discipline. Statistical methods, however, play an essential role in the practice of reliability. Over the years, as descried in this article, we have seen many developments of many new statistical methods for reliability analysis. These new developments have been driven by new technology (e.g., the ability to collect better degradation data) and the needs of engineers (e.g., to meet customer demand for high reliability). Newly developed methods tend to be widely used in industry if and only if they are available in easy-to-use, readily available software. Fortunately today we are seeing the best ideas and most needed tools being implemented in such software.