Faculty Details

Photo of George T. Duncan

George T. Duncan

Professor of Statistics, Emeritus


Email: gd17@andrew.cmu.edu

Biography

George Duncan joined the Carnegie Mellon faculty in the Department of Statistics in 1974. He has been on the Heinz School faculty since 1978. He has served as Director of the Heinz School’s MS, MPM and Ph.D. Programs. He served as Associate Dean for Faculty from 2001 to 2002. Prior to coming to Carnegie Mellon University, he taught in the mathematics department at the University of California at Davis. He is a Visiting Faculty Member at Los Alamos National Laboratory. He has been a visitor at Cambridge University and was the Lord Simon Visiting Professor at the University of Manchester in 2005. Duncan holds a B.S. and M.S. in Statistics from the University of Chicago and a Ph.D. in Statistics from the University of Minnesota.

Duncan is a Fellow of the American Statistical Association, a Fellow of the American Association for the Advancement of Science, a Fellow of the Royal Statistical Society, and an Elected Member of the International Statistical Institute. In 1996, he was elected Pittsburgh Statistician of the Year.

Duncan's general research interests are in Bayesian decision making and information technology and social accountability. His primary focus is on confidentiality of statistical databases. His work has appeared in leading journals including the Journal of the American Statistical Association, Management Science, Econometrica, Operations Research, Psychometrika, and Biometrika. He has given keynote presentations in New Zealand, Ireland, Italy, Portugal and England.
Teaching

His recent teaching includes statistical theory, advanced empirical methods, Bayesian inference, probabilistic methods in information technology, and management science.

As a Peace Corps volunteer in the Philippines from 1965 to 1967, Duncan taught at Mindanao State University. He served as editor of the Journal of the American Statistical Association (ASA); Secretary of the Statistical Educational Section of ASA; Chair of the ASA Committee on Statistics in Selected Professions; Chair of the Committee on Privacy and Confidentiality of ASA.

Between 1989 and 1993, he chaired the National Academy of Sciences' Panel on Confidentiality and Data Access, which resulted in the book, Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics. He has served on privacy and confidentiality committees of the American Medical Association, the National Research Council's Institute of Medicine, and The University of Michigan. He has served on National Academy of Sciences Panels on Research Access to Data, Use of Census Data in Transportation Studies, and Whither Biometrics?.

Selected Publications

Book

Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics George T. Duncan, Thomas B. Jabine, and Virginia A. de Wolf; Panel on Confidentiality and Data Access, National Academy Press, 1993

Articles

Exploring the Tension Between Privacy and the Social Benefits of Governmental Databases. Invited paper in Security, Privacy, and Technology edited by Podesta, Shane and Leone. The Century Foundation, 2004

Disclosure Risk vs. Data Utility: The R-U Confidentiality Map as Applied to Topcoding. Invited paper in special issue on Data Confidentiality in Chance (joint authored with S. Lynne Stokes), 2004

"Mediating the Tension Between Information Privacy and Information Access: The Role of Digital Government," George T. Duncan and Stephen Roehrig Public Information Technology: Policy And Management Issues, edited by G. David Garson, Idea Group, Hershey, PA 2003

Policy and practice on release of microdata. Proceedings of the 19th CEIES Seminar, “Innovative Solutions in Providing Access to Microdata. Eurostat. Lisbon, 2002 September 26.

"Confidentiality and Statistical Disclosure Limitation" International Encyclopedia of Social and Behavioral Sciences (2001)

"Forecasting analogous time series" (with Wilpen L. Gorr and Janusz Szczypula), Principles of Forecasting: A Handbook for Researchers and Practitioners (J. Scott Armstrong, ed.), Norwell, Ma: Kluwer Publishers, 2001

"Bayesian Insights on Disclosure Limitation: Mask or Impute?" (joint-authored with Sallie Keller-McNulty) Proceedings of the International Society for Bayesian Analysis, Crete (2000)

"Optimal disclosure limitation strategy in statistical databases: Deterring tracker attacks through additive noise" (joint authored with Sumitra Mukherjee) Journal of the American Statistical Association (2000)

Education


PhD, Statistics, University of Minnesota

Working Papers


Disclosure Risk vs. Data Utility through the R-U Confidentiality Map in Multivariate Settings

Information organizations, such as statistical agencies, must ensure that data access does not compromise the confidentiality afforded data providers, whether individuals or establishments. Recognizing that deidentification of data is generally inadequate to protect confidentiality against attack by a data snooper, information organizations (IOs)—such as statistical agencies, data archives, and trade associations—can implement a variety of disclosure limitation (DL) techniques—such as topcoding, noise addition and data swapping—in developing data products. Desirably, the resulting restricted data have both high data utility U to data users and low disclosure risk R from data snoopers. IOs lack a framework for examining tradeoffs between R and U under a specific DL procedure. They also lack systematic ways of comparing the performance of distinct DL procedures. To provide this framework and facilitate comparisons, the R-U confidentiality map is introduced to trace the joint impact on R and U to changes in the parameters of a DL procedure. Implementation of an R-U confidentiality map is illustrated in the case of multivariate noise addition. Analysis is provided for two important multivariate estimation problems: a data user seeks to estimate linear combinations of means and to estimate regression coefficients.

(Download)

Disclosure Risk vs. Data Utility: The R-U Confidentiality Map

Recognizing that deidentification of data is generally inadequate to protect their confidentiality against attack by a data snooper, information organizations (IOs) can apply a variety of disclosure limitation (DL) techniques, such as topcoding, noise addition and data swapping. Desirably, the resulting restricted data have both high data utility U to data users and low disclosure risk R from data snoopers. IOs lack a coherent framework for examining tradeoffs between R and U for a specific DL procedure. They also lack systematic ways of comparing the performance of distinct DL procedures. To provide this framework and facilitate comparisons, the R-U confidentiality map is introduced to trace the joint impact on R and U of changes in the parameters of a DL procedure. Implementation of an R-U confidentiality map is illustrated in real multivariate data cases for two DL techniques: topcoding and multivariate noise addition. Topcoding is examined for a Cobb-Douglas regression model, as fit to restricted data from the New York City Housing and Vacancy Survey. Multivariate additive noise is examined under various scenarios of attack, predicated on different knowledge states for a data snooper, and for different goals of a data analyst. We illustrate how simulation methods can be used to implement an empirical R-U confidentiality map, which is suitable for analytically intractable specifications of R, U and the disclosure limitation method. Application is made to the Schools and Staffing Survey, which is conducted by the National Center for Education Statistics.
 

(Download)

Mediating the Tension Between Information Privacy and Information Access: The Role of Digital Government

Government agencies collect and disseminate data that bear on the most important issues of public interest. Advances in information technology, particularly the Internet, have multiplied the tension between demands for evermore comprehensive databases and demands for the shelter of privacy. In mediating between these two conflicting demands, agencies must address a host of difficult problems. These include providing access to information while protecting confidentiality, coping with health information databases, and ensuring consistency with international standards. The policies of agencies are determined by what is right for them to do, what works for them, and what they are required to do by law. They must interpret and respect the ethical imperatives of democratic accountability, constitutional empowerment, and individual autonomy. They must keep pace with technological developments by developing effective measures for making information available to a broad range of users. They must both abide by the mandates of legislation and participate in the process of developing new legislation that is responsive to changes that affect their domain. In managing confidentiality and data access functions, agencies have two basic tools: techniques for disclosure limitation through restricted data and administrative procedures through restricted access. The technical procedures for disclosure limitation involve a range of mathematical and statistical tools. The administrative procedures can be implemented through a variety of institutional mechanisms, ranging from privacy advocates, through internal privacy review boards, to a data and access protection commission.

(Download)

Statistical Data Stewardship in the 21st Century: An Academic Perspective

This paper presents an academic perspective on a broad spectrum of ideas and best practices for statistical data collectors to ensure proper stewardship for personal information that they collect, process and disseminate. Academic researchers in confidentiality address statistical data stewardship both because of its inherent importance to society and because the mathematical and statistical problems that arise challenge their creativity and capability. To provide a factual basis for policy decisions, an information organization (IO) engages in a two-stage process: (1) It gathers sensitive personal and proprietary data of value for analysis from respondents who depend on the IO for confidentiality protection. (2) From these data, it develops and disseminates data products that are both useful and have low risk of confidentiality disclosure. The IO is a broker between the respondent who has a primary concern for confidentiality protection and the data user who has a primary concern for the utility of the data. This inherent tension is difficult to resolve because deidentification of the data is generally inadequate to protect their confidentiality against attack by a data snooper. Effective stewardship of statistical data requires restricted access or restricted data procedures. In developing restricted data, IOs apply disclosure limitation techniques to the original data. Desirably, the resulting restricted data have both high data utility U to users (analytically valid data) and low disclosure risk R (safe data). This paper explores the promise of the R-U confidentiality map, a chart that traces the impact on R and U of changes in the parameters of a disclosure limitation procedure. Theory for the R-U confidentiality map is developed for additive noise. By an implementation through simulation methods, an IO can develop an empirical R-U confidentiality map. Disclosure limitation for tabular data is discussed and a new method, called cyclic perturbation, is introduced. The challenges posed by on-line access are explored.

(Download)

Forecasting Analogous Time Series

Organizations that use time series forecasting on a regular basis generally forecast many variables, such as demand for many products or services. Within the population of variables forecasted by an organization, we can expect that there will be groups of analogous time series that follow similar, time-based patterns. The co-variation of analogous time series is a largely untapped source of information that can improve forecast accuracy (and explainability). This paper takes the Bayesian pooling approach to drawing information from analogous time series to model and forecast a given time series. Bayesian pooling uses data from analogous time series as multiple observations per time period in a group-level model. It then combines estimated parameters of the group model with conventional time series model parameters, using "shrinkage" weights estimated empirically from the data. Major benefits of this approach are that it 1) minimizes the number of parameters to be estimated (many other pooling approaches suffer from too many parameters to estimate), 2) builds on conventional time series models already familiar to forecasters, and 3) combines time series and cross-sectional perspectives in flexible and effective ways.
 

(Download)

Managing Information Privacy and Information Access in the Public Sector

Government agencies collect and disseminate data that bear on the most important issues of public interest. Advances in information technology, particularly the Internet, have multiplied the tension between demands for evermore comprehensive databases and demands for the shelter of privacy. In mediating between these two conflicting demands, agencies must address a host of difficult problems. These include providing access to information while protecting confidentiality, coping with health information databases, and ensuring consistency with international standards. The policies of agencies are determined by what is right for them to do, what works for them, and what they are required to do by law. They must interpret and respect the ethical imperatives of democratic accountability, constitutional empowerment, and individual autonomy. They must keep pace with technological developments by developing effective measures for making information available to a broad range of users. They must both abide by the mandates of legislation and participate in the process of developing new legislation that is responsive to changes that affect their domain. In managing confidentiality and data access functions, agencies have two basic tools: techniques for disclosure limitation through restricted data and administrative procedures through restricted access. The technical procedures for disclosure limitation involve a range of mathematical and statistical tools. The administrative procedures can be implemented through a variety of institutional mechanisms, ranging from privacy advocates, through internal privacy review boards, to a data and access protection commission

(Download)

Obtaining Information while Preserving Privacy: A Markov Perturbation Method for Tabular Data

Preserving privacy appears to conflict with providing information. Statistical information can, however, be provided while preserving a specified level of confidentiality protection. The general approach is to provide disclosure-limited data that maximizes its statistical utility subject to confidentiality constraints. Disclosure limitation based on Markov chain methods that respect the underlying uncertainty in real data is examined. For use with categorical data tables a method called Markov perturbation is proposed as an extension of the PRAM method of Kooiman, Willenborg, and Gouweleeuw (1997). Markov perturbation allows cross-classified marginal totals to be maintained and promises to provide more information than the commonly used cell suppression technique.
 

(Download)

Optimal Disclosure Limitation Strategy in Statistical Databases: Deterring Tracker Attacks Through Additive Noise

Disclosure limitation methods transform statistical databases to protect confidentiality. A statistical database responds to queries with aggregate statistics. The database administrator should maximize legitimate data access while keeping the risk of disclosure below an acceptable level. Legitimate users seek statistical information, generally in aggregate form; malicious users-the data snoopers-attempt to infer confidential information about an individual data subject. Tracker attacks are of special concern for databases accessed online. This article derives optimal disclosure limitation strategies under tracker attacks for the important case of data masking through additive noise. Operational measures of the utility of data access and of disclosure risk are developed. The utility of data access is expressed so that tradeoffs can be made between the quantity and the quality of data to be released. The article shows that an attack by a data snooper is better thwarted by a combination of query restriction and data masking than by either disclosure limitation method separately. Data masking by independent noise addition and data perturbation are considered as extreme cases in the continuum of data masking using positively correlated additive noise. Optimal strategies are established for the data snooper. Circumstances are determined under which adding autocorrelated noise is preferable to using existing methods of either independent noise addition or data perturbation. Both moving average and autoregressive noise addition is considered.

(Download)

Comparative Study of Cross Sectional Methods for Time Series with Structural Changes

Comparative Study of Cross Sectional Methods for Time Series with Structural Changes

(Download)

Presentations and Proceedings