State statistics in Ukraine operates according to a law from 1992. Most people having to do with the State Statistics Service (Derzhstat) know that the agency receives a lot of reports from respondents. However, data are available only in aggregate form, i.e. for the country as a whole, a region, and, sometimes, district or settlement. Detailed data sets obtained from observations (microdata), with rare exceptions, are not provided by Derzhstat.
The main reason for refusals to provide data is the protection of respondents’ personal data. At the same time, microdata is illegally “sold” in book markets or via the Internet without any protection of confidentiality. VoxUkraine took a look at how this issue is handled by EU statistical offices and what needs to be done to make official statistics available for research.
The issues of researchers’ lack of access to official statistics microdata have piled up for years without being resolved. Derzhstat did provide microdata at the request of users until 2012. However, those procedures did not meet the requirements of personal data protection legislation. Therefore, since 2013, access to data has been limited. In 2017, Derzhstat approved a methodology for ensuring confidentiality actually prohibiting itself from providing microdata to researchers until the agency developed all necessary rules and terms. The rules have not been implemented yet. In 2019, Derzhstat published on its website microdata files of labor market and household living conditions surveys. The list of indicators in these files is very limited, with lots of important information for research remaining unavailable.
The main task of statistical bodies in disseminating microdata is to protect personal information. Anyone participating in surveys by Derzhstat needs to be sure that the agency will not disclose their personal information. Similarly, Derzhstat should not disclose information about companies reporting to it.
To maintain confidentiality, the data provided by Derzhstat has to be impersonal. That is, data sets should not contain the names of people or companies. With no such information in the data set, it will not be possible to directly identify an individual or an entity. However, data sets may contain other indicators that might help identify a person indirectly. For example, this could be done by knowing the person’s address. Similarly, sometimes it is enough to know the type of a company’s activity and its location to identify it. Therefore, when Derzhstat disseminates microdata, it must ensure that it cannot be identified, either directly or indirectly.
For now, there is no single approach to disseminating microdata in the world and each country decides at the national level what kind of information, under what terms and to what extent it can disseminate. However, in most developed countries, statistical bodies provide researchers with data in one form or another. In the EU, the rules of disseminating microdata are based on Regulation No 557/2013 of the European Parliament and of the Council of July 17, 2013 as regards access to confidential data for scientific purposes.
According to the regulations, EU statistical offices use the following approaches:
- Statistical offices process data arrays in such way that it is not possible to identify respondents. Such a data set is then placed on a website, and all users have access to it. In the countries with personal data protection legislation similar to that of Ukraine, these data sets usually contain a small number of indicators.
Slovenia, Estonia, Latvia, and Italy publish the results of household surveys on income and expenses, living conditions, health, labor market research, etc. Information on legal entities is not disseminated under this approach.
- Statistical offices provide microdata at the request of a research entity. They process the data set in such a way that it is not possible to identify the respondent. Users pay extra for such processing.
Countries may impose additional requirements on organizations. For instance, in the Netherlands, information is provided to domestic organizations, and to foreign entities only through cooperation with Dutch organizations.
- Researchers from accredited research institutions are allowed to work with data in a specially equipped room of the statistical office. Providing access points in the premises of a statistical service makes it possible to provide researchers with the most detailed and “sensitive” data with minimal risk, since researchers work under surveillance. A researcher’s only opportunity is to analyze the data. Saving data to the external media, sending it by e-mail, printing it, etc. is not possible. Before researchers can obtain the results of their research and permission to publish them, a statistical office employee will check the results for confidentiality.
Small countries such as Slovenia and Latvia have one access center each. Estonia provides this service in three cities. The fee is charged only if users request preliminary preparation of data.
- Researchers are given the opportunity to process data remotely. They receive a detailed description of the data set, write an analysis script and send it to the statistical office. The employees there process microdata according to the script and send the results back to the scholar. For convenience, sometimes the statistical office provides researchers with a “fake” data set retaining the original structure so that they can test their algorithm.
- Researchers are given remote access so that they can work directly with microdata from their work computer. As in the case of a physical access point, the researchers cannot in any way save the microdata obtained, and she or he receives the results of her / his own work only upon verification by the statistical office. Remote access is possible thanks to the VPN (virtual private network) technology allowing for a secure connection between the data server and the user’s computer.
Data on enterprises are provided only in this way in Slovenia, Estonia and Latvia. In the Netherlands, you have to pay for remote access to data, and Sweden provides remote access even from a mobile phone.
To implement such approaches, statistical offices need to put a number of procedures in place.
Anonymization of microdata. Direct identifiers – anything that directly identifies a legal entity or an individual, such as the name, address or taxpayer number – are removed from the data set. The following procedures can be used to prevent the user from indirectly identifying the respondent:
- aggregating data into interval series. For example, an age group can be indicated instead of specifying a person’s age;
- replacing the values, that are significantly larger or smaller than others, with average values of the group’s variable;
- removing individual values from the data array.
To formally estimate the risk of identifying the respondent, statistical offices often use k-anonymity based metrics and calculate them using R-Package sdcMicro.
Accrediting research entities and projects. In order to find out whether a particular organization can be trusted with confidential data and whether it will actually use it for scientific purposes, EU states conduct special accreditation of research organizations taking into account the organizations’ purpose, reputation, structure, independence, and available infrastructure, to ensure data security.
Typically, EU countries simultaneously use several microdata dissemination channels to meet the interests of different audiences.
For instance, in Estonia, depending on the level of confidentiality, there are the following ways of disseminating microdata:
- public access files available on Statistics Estonia’s website;
- files with a low level of confidentiality offered by remote access or in special access points;
- files with a high level of confidentiality available only in special access points.
Statistics Estonia can also provide information processed specifically according to the requirements of the research organization. The following data is available at an additional cost: € 100 for the first request, € 50 for each subsequent request during one year. Breach of data confidentiality is punishable by a fine of 800 euros if committed by individuals and 3,200 euros if committed by legal persons (Statistics Estonia’s experience in providing national and trans-border access to micro-data, 2013, p.5).
In Estonia, they got convinced that providing data generates demand for this service. On average, Statistics Estonia receives about 25 requests annually, with more than half of them being requests for information through remote access (as of 2013).
In Italy, the range of products with microdata is currently wide, ranging from free data sets available on request to specialized data processing. The access points networks allows researchers to analyze any microdata owned by the ISTAT in any of the 18 regional offices throughout Italy. And an additional network of servers allows researchers to access the original confidential data through a secure channel.
The ISTAT also publishes public, mainly educational, files containing a small number of variables, basic information, few observations and a simplified data structure. If we want to increase statistical literacy of the population in general and students in particular, college and school students must learn how to use real survey data, process “raw” data and, most importantly, obtain knowledge based on data.
Statistics Netherlands publishes files for general use, as well as offers numerous other data sets for research. Researchers access data through a secure Internet connection. Although the data is free, paid services are used to gain access to it. In total, creating a safe environment for researchers costs about 2 million euros, of which only 700,000 are funded from the state budget. Therefore, most of it is paid for by the consumers of this service. One study costs an average of 2,000 euros.
In its policy on disseminating microdata, Derzhstat must decide which approach to use. VoxUkraine has conducted 28 interviews with users and employees of the State Statistics Service to assess the agency’s needs and real capacity. The results show that the dissemination of microdata should be rationally organized based on a combination of several approaches:
- publishing on the official website publicly available files that make it impossible to identify the respondent directly or indirectly;
- providing microdata on the basis of a written request to organizations conducting data analysis as part of research projects and covering Derzhstat’s data anonymization costs;
- ensuring access for accredited entities to depersonalized data in Derzhstat’s premises;
- providing remote access in the medium term.
This publication was prepared with support from the UNDP project “Civil Society for the Development of Democracy and Human Rights in Ukraine” implemented with the financial support of the Ministry of Foreign Affairs of Denmark.
Any opinions, conclusions or recommendations expressed are those of the authors or editors of the publication and do not necessarily reflect the views of the Ministry of Foreign Affairs of Denmark, the United Nations Development Program or other UN agencies. The materials of the publication are protected by copyright.
However, the United Nations Development Program in Ukraine encourages the dissemination of this information for non-commercial purposes.
The author doesn`t work for, consult to, own shares in or receive funding from any company or organization that would benefit from this article, and have no relevant affiliations