‘Raw’ Data Does Not Exist

‘Raw’ Data Does Not Exist
Data Base. Thomas Hawk / Flickr

Our times are often called the big data era, and the necessity to be data-driven has now become a requirement for business management, public healthcare, public administration, etc. But anthropological research shows that the data paradox lies in the fact that it is never ‘given’, i.e. ‘raw’ information does not exist.

 

Skill of Convincing

 

 

The history of the notion ‘data’ suggests that it is usually resorted for rhetoric purposes. On the one hand, data is what precedes argumentation by forming a rhetoric framework for it. On the other hand, availability of this or that information itself can lead to a certain conclusion. The data, contrary to the facts, can be ‘good’ or ‘bad’, ‘incomplete’ or ‘excessive’. Nevertheless, just appealing thereto can be often enough to lay claim to be objective without question. In other words, the data can replace the argument, and today it happens more and more often.

 

Rhetoric stringency when appealing to ‘data’ is also supported by the notion of its abstract character, which allows using pieces of information outside the context and independently of the equipment for their recording and storage. In the words of Lev Manovich, a researcher of media, the data does not just 'exist' in wait of analysis, it should be first ‘generated’. To this effect, one should ‘sort it out’ and present some aspect of reality as information. For that to happen, more than the creative imagination of a talented researcher or an administrator is required, as it requires institutional and physical infrastructures.

 

 

The power of contemporary states was based on ‘big data’ long before the big data became the mainstream: the first step of the European states of the modern age in governing the national economy and population was to provide an overview of natural resources in their territories and the wealth of their own population, and for that effect, they had to create a unified system of weights and measures, to draw maps of mineral deposits and woodlands, as well as to introduce permanent family names for their subjects, in other words, to convert them into the ‘data’.

 

 

How Standardization Creates Information

 

 

The metaphor of the data as ‘raw material’, which is used to describe business models of ‘platform’ companies, such as Facebook, Amazon and Uber, does not work even as applied to natural resources, as shown in the case of the Prussian forestry in the late 18th century. In order to optimize the wood production officials have invented the forestry for analysing and standardizing the economic characteristics of trees. Forest guards gathered data on the size, volume and age of trees taking into account the normal cycle of their growth and maturation; then the data was aggregated in tables for calculating income from future crops. However, this did not take into account information on the tree species diversity, the symbiotic relationship between trees, insects and animals. In such a way, wood should have been first presented as a set of information, it was needed to increase the crop yield. The data on wood was not ‘raw’ already in this sense: it was generated through standardization using the tools of mathematics of finance.

 

 

Another example is futures, which provide a means of insurance against unforeseen changes in demand or price fluctuations of this or that product. According to historians, the futures trading requires extremely intensive work on standardization of assets to be exchanged and units of their measurement first of all. For example, in America of the 19th century much of the futures trading was connected with grain supplies, and because the subject matter of the deals was cropped not harvested yet, it was impossible to assess the quality of crops directly; that is exactly why standards played a key role. When grain was transported with wagons and stored in grain elevators, it was mixed until full homogeneity resulting in ‘averaging’ of its quality. In this way, the grain measured with ‘bags’, which was grown by a particular farmer in particular environmental conditions, lost link to its production conditions. From that moment on, players in the market could deal with the aggregated categories of the grain, to which unified quality standards could apply. The responsibility for them was assigned to the committee of inspectors authorized by the state government, who also controlled the scales according to which grain ‘units’ (bushels) were measured. Although the conclusion of futures does not necessarily imply the physical delivery of goods, the standardization of grain was a condition of their execution. Buying grain from one farmer means accepting the risks relating to the quality of goods (deviation from the standard makes different kinds of the grain non-measurable), manipulation of the measurement units (different farmers have bags of different size), and the necessity to enter into negotiations for cancelling or changing the supply conditions. If there are standard categories of the quality and standard measurement units, contracts can be concluded even if physical grain remains in place, i.e. based on the ‘data’ of its quality and quantity.

 

Risks Posed by Standardization

 

 

To conclude, data cannot be ‘raw’ since obtaining it requires standardization carried out by selecting some aspects of reality (and transforming them into the data) and excluding others. It thus seems reasonable to ask a question: for what purpose was the data collected and what was missing.

 

Sometimes, the standardization entails negative consequences. Thus, the Prussian forestry has developed the standardized tree concept ‘Normalbaum’ determined by the volume of wood of specific species suitable for sale. When the time came to restore deforested areas, this particular idea was adopted in the planning of the new plantings. The resulted ‘normal forest’ became an ecological catastrophe since it had undermined the natural work of the woodland ecosystem requiring trees of various species, deadfall and dead-and-down. It is simpler to control ‘normal woodland’ because it is easier to gather information about it. However, changes in the ecosystem and a complex balance among trees, animals and insects in the longer term have resulted in the reduction of growth rates.

 

 

A source of the risk here is not just uncertainty and incompleteness of the data, but also the feedback between it and that reality from which it is ‘extracted’, whether this is the Prussian wood, American grain or information about pages a social network’s users like. A desire to control based on the data could result in simply nothing but only that data being controlled. It is precisely because no ‘raw’ data exists, it is so important to take into account the work on its creation and extraction; radically simplifying the object of analysis or control can often transform it beyond recognition.