This post was originally published by Kevin Allen at Towards Data Science
Considering the Data Science Life Cycle as a life cycle enables a natural consideration of crucial overarching factors such as reproducibility, documentation and meta data, ethics, and archiving of research artefacts such as data and code. The Data Science Life Cycle provides guidance on the multi-faceted set of skills and personnel needed for data science, for example “skills for dealing with organizational artefacts of large-scale cluster computing. The new skills cope with severe new constraints on algorithms posed by the multiprocessor/networked world.”7 Workforce development is therefore incorporated into the life cycle approach, which is especially germane to data science as “enthusiasm feeds on the notable successes scored in the last decade by brand-name global information technology (IT) enterprises, such as Google and Amazon.”7
The Data Science Life Cycle engages relevant stakeholders in the larger research community in a systematic way, including not only data science researchers but others such as archivists, libraries and librarians, legal experts, publishers, funding agencies, and scientific societies. It gives a framework to clarify how different contributions knit together to support each other to advance data science.
A life cycle approach encourages and enables a unification of views regarding data science and gives us a footing from which to adapt and evolve the practice and teaching of data science to research projects and to institutional strengths. There are commonalities to nearly all data science efforts, for example, data wrangling, data inference, code writing, artefact creation and sharing. A common intellectual framework can facilitate knowledge sharing about data science as a discipline across different the fields and domains using data science methods in their research.
A data science curriculum. Conceptualizing data science as a life cycle also gives a way to position classes and sequences to teach core and elective data science skills, indicating where existing courses may fit and where new courses may need to be developed. It helps define a curriculum by using the steps of the Data Science Life Cycle as a pedagogical sequence and provides for the inclusion of overarching topics such as data science ethics, and intellectual property, reproducibility, or data governance considerations.24 Perhaps most importantly the Data Science Life Cycle can indicate courses that may be out of scope and new course topics essential to data science.
The accompanying table shows how several commonly offered courses could be matched to the steps described by the Data Science Life Cycle described in Figure 2. Although not included in the table, each step can be augmented by the creation of new targeted classes if needed, such as Data Policy, Reproducibility in Data Science, Data Science Ethics, Circuit Design for Deep Learning, Software Engineering Principles for Data Science, Mathematics for Data Science, Interoperability and Integration of Different Data Sources, Data Science with Streaming Data, Software Preservation and Archiving, Workflow Tools for Data Science, Intellectual Property for Scientific Code and Data. The list goes on. The addition of domain specific optional courses could define tracks or specializations within a data science curriculum (for example, Earth sciences, bioinformatics, sociology; cyberinfrastructure for data science) to create a potential DS+X degree in the spirit of the CS+X degrees discussed previously.
Table. An example mapping from some routinely offered courses to the steps of the Data Science Life Cycle.
The emergence of a discipline of data science is necessary to advance data science as well as encourage reliable and reproducible discoveries, elevating the endeavour to a branch of the scientific method. Data science may eventually develop as a set of discipline-adapted discovery techniques and practices, perhaps including a cross-disciplinary core. Data science is benefitting from close association with industry as computer science did at its inception, for example, IBM’s creation of the Watson Scientific Computing Laboratory at Columbia University in 1945.14 Analysis of consumer data by Google, Facebook, and Amazon is generating prominent successes in image identification and voice transcription among other areas. Opportunities for industry employment and workforce development create an attractive feature of data science at the institutional level.
Elevating the practice of data science to a science. The Data Science Life Cycle framework is an essential conceptualization in the development of data science as a science. A recent National Academies of Sciences, Engineering, and Medicine consensus report on “Reproducibility and Replication in Science” spotlights the need to better develop scientific underpinnings for computationally and data-enabled research investigations21 and a March 2019 National Academy of Sciences Colloquium entitled “The Science of Deep Learning” aimed to bring scientific foundations to the fore of the deep learning research agenda.19 The discussion regarding the scientific underpinnings of data analysis began in 1962, when John Tukey presented three criteria a discipline ought to meet in order to be considered a science:30
- Intellectual content.
- Organization into an understandable form.
- Reliance upon the test of experience as the ultimate standard of validity.
If one accepts these criteria, the Data Science Life Cycle can be leveraged to demonstrate intellectual content, promote its organization (see Figure 2), and incorporate external tests of the validity of findings. On this last point, the structure of the Data Science Life Cycle builds in reproducibility, reuse, and verification of results with its embedded notion that artefacts supporting the claims (such as data, code, workflow information) be made available as part of the publication (life cycle) process. Research on platforms and infrastructure for data science facilitates Tukey’s second criterion by advancing organizational topics such as artefact meta data; containerization, packaging and dissemination standards; and community expectations regarding FAIR (find-ability, accessibility, interoperability, and reusability), archiving, and persistence of the artefacts produced by data science. These efforts also help enable comparisons of data science pipelines to increase understanding of any differences in outcomes of “tests of experience.”29 The Data Science Life Cycle exposes these topics as areas for research within the discipline of data science.2 Several conferences and journals have begun to require artefact availability and infrastructure projects are emerging to support reproducibility across the data science discovery pipeline.3 Considering these issues through a Data Science Life Cycle gives a frame for their inclusion as research areas integral to the discipline of Data Science. Data science without a unifying framework risks being a set of disparate computational activities in various scientific domains, rather than a coherent field of inquiry producing reliable reproducible knowledge.
Without a flexible yet unified overarching framework we risk missing opportunities for discovering and addressing research issues within data science and training students in effective scientific methodologies for reliable and transparent data-enabled discovery. Data science brings new research topics, for example, computational reproducibility; ethics in data science; cyberinfrastructure and tools for data science. Without the Data Science Life Cycle approach, we risk an implementation of data science that too closely hews to a view that reflects the perspective of a particular discipline and could miss opportunities to share knowledge on data science research and teaching broadly across disciplines. In addition, a Data Science Life Cycle approach can give university leadership a framework to leverage their existing resources on campus as they strategize support for a cross-disciplinary data science curriculum and research agenda. The life cycle approach allows data science research and curriculum efforts to support the development of a scientific discipline, enabling progress toward fulfilling Tukey’s three criteria for a science.
1. Berman, F. et al. Realizing the potential of data science. Commun. ACM 61, 4, (Apr. 2018), 67–72; https://cacm.acm.org/magazines/2018/4/226372-realizing-the-potential-of-data-science/fulltext
2. Bernau, C. et al. Cross-study validation for the assessment of prediction algorithms. Bioinformatics 30, 12; https://academic.oup.com/bioinformatics/article/30/12/i105/388164
3. Brinckman, A. et al. Computing environments for reproducibility: Capturing the ‘whole tale.’ Future Generation Computer System 94, 854–867;
5. Collberg C. and Proebsting, T.A. Repeatability in computer systems research. Commun. ACM 59, 3 (Mar. 2016), 62–69; https://doi.org/10.1145/2812803
6. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conf. Computer Vision and Pattern Recognition, 2009; https://ieeexplore.ieee.org/document/5206848
7. Dhar, V. Data science and prediction. Commun. ACM 56, 12 (Dec. 2013); 64–73; https://doi.org/10.1145/2500499
8. Donoho, D.L. 50 years of data science. J. Computational and Graphical Statistics 26, 4 (2017); https://www.tandfonline.com/doi/abs/10.1080/10618600.2017.1384734
9. Donoho, D.L., Maleki, A., Ur Rahman, I., Shahram, M. and Stodden, V. Reproducible research in computational harmonic analysis. Computing in Science & Engineering 11, 1, (Jan.-Feb 2009).
10. Donoho, D.L. and Stodden, V. 2015. Reproducible research in the mathematical sciences. J. Higham, ed. The Princeton Companion to Applied Mathematics.
11. Golub, T.R. et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 5439 (1999), 531–537.
12. Guyon, I. et al. Gene selection for cancer classification using support vector machines. Machine Learning 46 (Jan. 2002); https://doi.org/10.1023/A:1012487302797
13. Hales, T. Mathematics in the age of the Turing machine. Turing’s Legacy Developments from Turing’s Ideas in Logic. R. Downey, ed., 2014; https://www.cambridge.org/core/books/turings-legacy/mathematics-in-the-age-of-the-turing-machine/376464C81D16F9323EEFB2A2A924D2F4
14. Hoover, H. Quantitative analysis and literary studies. A Companion to Digital Literary Studies. S. Schreibman and R. Siemens, eds. Blackwell, Oxford, U.K., 2008.
15. IBM. The Origins of Computer Science; https://www.ibm.com/ibm/history/ibm100/us/en/icons/compsci/
16. Ivie, P. and Thain, D. Reproducibility in scientific computing. ACM Comput. Surv. 51, 3 (2018), Art. 63; https://doi.org/10.1145/3186266
17. Krizhevsky, A., Sutskever, I. and Hinton, G.E. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25, 2012. F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, eds; http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
18. Lazer, D. et al. Computational social science. Science 323, 5915 (2009); http://science.sciencemag.org/content/323/5915/721
19. Manyika, J. et al. Big Data: The Next Frontier for Innovation, Competition and Productivity. McKinsey Global Institute, 2011; http://www.mckinsey.com/business-functions/business-technology/our-insights/big-data-the-next-frontier-for-innovation
20. NAS Sackler Colloquium. The Science of Deep Learning, 2019; http://www.cvent.com/events/the-science-of-deep-learning/event-summary-a96a8734ffa841ea8d5439e081b50f54.aspx
21. National Academies of Sciences, Engineering, and Medicine. Data Science for Undergraduates: Opportunities and Options. The National Academies Press, Washington, D.C.; https://doi.org/10.17226/25104.
22. National Academies of Sciences, Engineering, and Medicine. Reproducibility and Replicability in Science. The National Academies Press, Washington, D.C., 2019; https://doi.org/10.17226/25303
23. Steering Committee on Computational Physics. Computation as a Tool for Discovery in Physics. Report to the National Science Foundation, 2002; https://www.nsf.gov/pubs/2002/nsf02176/nsf02176.pdf
24. Ouzounis, C.A. Rise and demise of bioinformatics? Promise and progress. PLoS Comput Biol 8, 4 (2012), e1002487; https://doi.org/10.1371/journal.pcbi.1002487
25. Saltz, J.S., Dewar, N.I., Heckman and R. Key concepts for a data science ethics curriculum. In Proceedings of the 49th ACM Technical Symp. Computer Science Education. ACM, New York, NY, 952–957; https://doi.org/10.1145/3159450.3159483
26. Siewert, S. Big data in the cloud: Data velocity, volume, variety, veracity. IBM Developer, July 9, 2013; https://www.ibm.com/developerworks/library/bd-bigdatacloud/index.html
27. Stodden, V. The legal framework for reproducible research in the sciences: Licensing and copyright. Computing in Science and Engineering 11, 1 (2009), 35–40.
28. Stodden, V., Guo, P. and Ma, Z. Toward reproducible computational research: An empirical analysis of data and code policy adoption by journals. PLoS ONE 8, 6 (2013), e67111; https://doi.org/10,1371/journalpone.0067111.
29. Stodden, V., McNutt, M., Bailey, D.H., Deelman, E., Gil, Y., Hanson, B., Heroux, M.A., Ioannidis, J.P.A., Taufer, M. Enhancing Reproducibility for Computational Methods. Science 354, 6317 (Dec. 9, 2016).
30. Stodden, V., Wu, X. and Sochat, V. AIM: An abstraction for improving machine learning prediction. In Proceedings of the IEEE Data Science Workshop. (Lausanne, Switzerland, 2018), 1–5.
31. Tukey, J.W. The Future of Data Analysis. Ann. Math. Statist. 33, 1 (1962), 1–67.
This post was originally published by Kevin Allen at Towards Data Science