History of Big Data: From Analog to Digital
This piece was originally written in May of 2020 for "History of Big Data" at New York University.
The history of data collection, analysis, and storage is certainly an interdisciplinary one, involving biologists like Linnaeus through to engineers like Tim Berners-Lee, who are respectable thinkers in seemingly unrelated academic realms. Yet their greatest contributions to our modern world is arguably identical: a new way to see the world and subsequently organize information. As the science of data moves into the 20th and 21st century, a term used more and more is “revolution,” usually in context of the “computer revolution” or “information revolution,” in prediction of a new socio-cultural trend. These terms likely inherit their names from the “scientific revolution” of the 1500s onwards, the last time our ways of understanding the world experienced a pivotal change. But that does not necessarily mean that the two periods are equatable in principle, and in fact, the smooth and seamless integration of computers into our society begs the question of whether they were a revolutionary development or an evolutionary development. The goal of this paper will be to ultimately prove that to a large extent, this significant integration of computers into government, corporate, and finally personal life was evolutionary as opposed to revolutionary due to the imitation of analog practices rather than a complete upheaval. In other words, principles of organization and understanding data in our world remain the same, and the practices are what have been altered.
To achieve this goal, the paper will take a comparative historical approach to draw similarities between the past and present. This means approaching different technological developments and their supposed predecessors by category and purpose rather than chronology, which makes the comparisons more obvious. It also takes into account a wide range of texts through relevant historical periods, albeit with varying degrees of correlation with the present, some more self-evident than others. The paper will then finish with two case studies on the Census and the railroad to ground the discussion.
Coping with Information Overload
A pattern that can be seen through every stage of this “evolution” process is the coping with the excess of information, which initially led to an evolution of notetaking, and ends with the invention of the computer. One of the earliest instances of attempting to solve information excess is documented by Müller-Wille and Charmantier’s account on Linnaeus and his note-taking techniques. The study concluded that Linnaeus “had to move on from simple tables and diagrams to more complex and flexible ways of organizing his data,” which meant that “the tools he created kept evolving.” This refers to his experimentation with presenting large amounts of information about animals and plants, a task which in and of itself has contributed to the information overload of his time. The study followed his “paper technologies” which span from initial documentations of his thoughts on loose leaf manuscripts, to organized notebooks, then index cards, and beyond to adapt to the different amount of information and flexibility he needs in his ever-expanding database of the genus. Similarly, the information overload of today follows an extremely similar discourse. As Siva Viadhyanathan explains in The Googlization of Everything, Google’s “first brilliant innovation was… its search algorithm” which found a way to index everything on the internet and make recommendations to its users based on their search query. However, it accomplishes what we celebrate today as a technological feat through a similar process of- trial and error as Linnaeus, and tweaking old ways to suit the new. In fact, Google and Linnaeus probably started with a similar intent of sharing valuable information and knowledge with the public, yet not knowing quite how to achieve that. The early search engine industry began with a strong roster of competitors to Google, including Hotbot and Alta Vista, names scarcely heard today, who all served a similar purpose of indexing data on the internet in an organized manner. Google finally received the recognition it deserved when technology reviewers see value in calculating page authority and determine their “ importance, measured in popularity, of the sites that are linking to the page” which happens to be one of the best indicators of credibility to this day.
The point here is Google has not always been the standard search engine since the invention of the internet, rather it was the strongest one that prevailed, consistently tweaking its methods to find the sweet spot for its users. This is equatable to how Linnaeus slowly changed his note-taking techniques throughout the years to cope with producing an ever-expanding database, yet still having it retain the same quality of information as his previous notebooks. Despite his note taking techniques improving in effectiveness throughout the years with new features, it was by no means “revolutionary” with every advancement Linnaeus makes. In essence, data writing and storing has fundamentally remained the same as it moves from analog to digital, and arguably the only change is the context of our time and the medium by which we store information. This transition resembles an imitation of existing practices more so than a revolution or upheaval to create something entirely new.
Moving beyond note taking, arguably data visualization is something individuals and small businesses have benefited greatly from through owning personal computers, making complex charting available in the comfort of one’s home. Historical evidence suggests that overwhelmingly, computers did not innovate, but rather served to improve overall convenience and ease of access to existing ideas rather than invent new modes of charting, which would undoubtedly categorize this as an evolution of the analog. Specifically, Joseph Priestley documents in A Description of a Chart of Biography his fascination with the timeline, which visually represents the abstract flow of time, a means of showing important events in relation to time which is still commonly used today. With formal communication, there is yet to be another method of visualizing time since the introduction of computers which provide the same level of credibility as a traditional timeline. This is hardly even an “evolution,” with just mildly improved convenience in creating the same timelines without the constraints of paper and its borders. The real evolution comes with charting, as William Playfair would invent in The Commercial Political Atlas and Statistical Breviary and subsequently present to King Louis XVI, which contained 44 charts that displayed important economic data about the country. He is the inventor of the line chart, bar chart, and pie chart, which are some of the most common forms of data visualization used today to represent data in series or the relationship between different figures.
Today, Microsoft dominantes the productivity market for professionals, and the Excel product offers graphing options that seemingly have not changed in essence since Playfair’s era. Albeit a few more trivial options are available to instantaneously produce visually pleasing graphics, or illuminate different patterns without manual input. But on the topic of evolution, the core innovation behind the technology remains the same Cartesian plane which Playfair would produce graphs on by hand, charting the relationship on two axes. The primary area of improvement with the advancement of personal computers is still convenience and ease of access for the common person or small business owner who previously would not afford a large workforce or a mainframe computer.
Perhaps one may argue that Excel should be more accurately compared with the ledger, which is a fair observation, and hardly changes the conclusion. Nearly three decades ago, Steven Levy would write about in A Spreadsheet Way of Knowledge the innovation of the electronic spreadsheet, which is a digital version of a ledger sheet that accountants would use. The program would run on a floppy disk on Apple computers, and it allowed people to gain “a more accurate picture of their businesses and let them see … how they might grow.” The difference being, this digital ledger is able to complete automatic calculations, making previously unrealistic recalculations of variable changes far more convenient. One would be correct to say this technology is life-changing for any individuals, who “before spreadsheets… would have taken a guess. [But] now they feel obligated to run the numbers.” Without the intent to discredit the contribution of spreadsheets to society, it is still worth pointing out that the rows and column system where the intersecting point would store a single cell of data is in fact still a replication of existing analog technology. The improved convenience, as shown by the intuitive understanding of the digital ledger as the product was first introduced, only goes further to show that computer technology builds upon previously established foundations.
As a bonus, Levy would further observe that “computer programs are said to use different “metaphors” to organize their tasks; a program might use the metaphor of a Rolodex, or a file cabinet.” That is to say that the design of the digital infrastructure relies on existing analog systems or inventions, and its primary function is to instead streamline it as manpower approaches a bottleneck and computers were introduced to break limits.
Railroads and People-Programming
To ground the difference between “evolutionary” versus “revolutionary” in historical examples of larger scale, the railroad provides an interesting case study. To some extent, people were already tasked with assignments that computers were in fact invented to do. Therefore, with the railroad companies, it entailed a simple substitution to improve efficiency, hence an “evolutionary” process rather than a “revolutionary” process.
The 1840s marked a time where there was a “crisis of safety on the railroads” and the construction of larger systems were actually delayed “because [the companies] lacked the means to control them.” This led to a “stream of innovations in information processing, bureaucratic control, and communications.” A series of collisions during this time period were blamed on the failure of communications and precise programming, which referred to the fact that there was no way of knowing where trains were throughout the day, in addition to massive delays with the original schedules. The beginning of the centralization of control began with conductors, who were seen as possibly “the first persons in history to be used as programmable, distributed decision makers in the control of fast-moving flows through a system.” His role was essentially to respond to unforeseen events, and manage operations between stops. The conductor was hence “programmable” because for the first time, a human being was used “not for their strength or agility, nor for their knowledge or intelligence, but for the more objective capacity for their brains to store and process information,” which was a central role that computers would play in the development of the information age.
Perhaps this was precisely why, as soon as the telegraph was popularized and modern telecommunication infrastructures were available, microcomputers were so quick to replace a lot of these functions on trains provided that a centralized computer system or network was also in place. A separate example of this is the use of network access to manage the plane seat reservation systems and allow for faster bookings in the modern day.
But back on the subject of railroads, the administrative system underwent further reorganizations, which ended up establishing a “solid line of authority and command” with a head office and regional offices, giving birth to the first “American business enterprise.” The catch with this new “network” of communication is significantly more data is produced than the head office can reasonably process, so a new solution was necessary. This was when the concept of “pre-processing” strips all data down to the most important bare minimums at each stage of reporting, that can ultimately allow the head office to make informed decisions based on less data. Coincidentally, or perhaps not, this was also an important function that computers serve: to find patterns out of noise. With the possibility of collecting endless data in endless depth and detail, a solution was in need to make reasonable generalizations in order to better inform decision making rather than complicate it. As evident, this is once again an example of computer culture imitating pre-existing practices to enhance the process.
The Census Before & After Computers
In roughly the same time period, the US Census was established, and provides a similarly interesting perspective as it existed before the introduction of computing, but adapted well to its technology after. Despite computer technology drastically increasing the effectiveness of the census and collecting more valuable information for the country, its purpose yet again remains the same: to understand the country’s demographic better, perhaps in more depth with the aid of technology. The Census was crucial as it was literally needed to ensure fair and even representation in congress, and an accurate knowledge of the population is essential to the processes of democracy. In a quickly developing and expanding nation like the US of the 1790s, some form or another of the Census was inevitable.
As alluded to, the first US Census took place in 1790, and included only five broad categories: Free White males of 16 years and upward, free White males under 16 years, free White females, all other free persons, and slaves. As the years progressed onwards, gradually more ages and economic data were introduced based on the need for the information for policy making. As Margo J. Anderson would explain in The American Census: A Social History, the Census staff worked hard to develop “new classification systems, [increase] the amount of detail in the published data, and [codify] rules for data analysis” so that a more accurate picture of the country can be delivered to the congress to assist them in making policies. But precisely for this reason, the slow processing of information led to the Census eventually becoming formally abolished and reorganized. A turning point came only in 1950 when the UNIVAC I was installed in replacement of the older Hollerith's electronic tabulators. This newer modern computer and was used immediately to tally the 1950 population census and 1954 economic census. According to the Census Bureau, the UNIVAC “excelled at working with the repetitive but intricate mathematics involved in weighting and sampling for these surveys.” This would have drastically cut the processing time it takes to fully account for an ever increasing volume of data collected from Americans to improve processes of democracy. It is also worth noting that the UNIVAC would slot into just the processing phase of the Census, hardly impacting the methodology other than pushing the limits of what can be possible.
The point of this case study was to show that even in these most extreme circumstances where state-of-the-art technology was implemented for complex data processing, it merely simplified and expedited a process that involves human components, and still could. Computers were used most effectively in the history of technology when it integrates seamlessly into analog and human systems and processes, rather than reinventing. Just like the railway system, the Census also brings the conclusion that the integration of computers into government, corporate, and personal life is more evolutionary.
Ultimately, the purpose of this essay was not to undermine the work that computers have done for the society, nor the efforts that countless computer engineers have put into shaping the information-dependent world that we live in today. Rather, it was to prove a point that it may sound better to label a new movement as a “revolution” to perhaps sensationalize the phenomenon, yet there are fundamental differences to recognize when discussing the history of data in an academic context.
All things considered, the use of computers in our society has to reshape and reinvent existing systems and practices in order for it to classify as a “revolution,” which leads to pivotal change. Although new inventions have broadened our horizons to recognize what can be possible in the future with improved 5G networks, artificial intelligence capabilities, and blockchain technology, to name a few, they are yet to be pivotal in reshaping society. At least thus far in 2020 when this paper is being written, computers have mostly supplemented existing systems to make the processes more efficient and effective. The term “evolution” suggests not merely a peaceful co-existence that builds upon existing foundations, but also results in a better version of the originals, which is certainly true in the cases of the railroad company where lives have been saved, or the census in improving our country’s democracy. Computers have achieved this result by imitating analog practices and integrating themselves into our pre-existing systems, which could still function at a lesser capacity without computers. The computers merely automate tasks which would have been tedious or greatly time consuming for a human.
That said, there is no reason why the computer “evolution” could not become a “revolution” in the near future. As Marshall McLuhan would put it, “the medium is the message.” New technologies coupled with new methods of statistical analysis truly have the capacity to create a positive feedback loop in which computers can play a monumental role in constructing a new America or a new way of living for people.