An overview of methods for treating selectivity in big data sources

Maciej Beresewicz, Risto Tapio Lehtonen, Fernando Reis, Loredana Di Consiglio, Martin Karlberg

Tutkimustuotos: Kirja/raporttiTutkimusraportti

Kuvaus

Official statistics is now considering seriously big data as a significant data source for producing statistics. It holds the potential for providing faster, cheaper, more detailed and completely new types of statistics. However, the use of big data brings also several challenges. One of them is the non-probabilistic character of most sources of big data, as very often they were not designed to produce statistics. The resulting selectivity bias is therefore a major concern when using big data. This paper presents a statistical approach to big data, searching for a definition meaningful from the statistical point of view and identifying its main statistical characteristics. It then argues that big data sources share many characteristics with Internet opt-in panel surveys and proposes this as a reference to address selectivity and coverage problems in big data. Coverage and the self-selection process are briefly discussed in mobile network data, Twitter, Google Trends and Wikipedia page views data. An overview of methods which can be used to address selectivity and eliminate, or mitigate, bias is then presented, covering both methods applied at individual level, i.e. at the level of the statistical unit, and at domain level, i.e. at the level of the produced statistics. Finally, the applicability of the methods to the several big data sources is briefly discussed and a framework for adjusting selectivity in big data is proposed.
Alkuperäiskielienglanti
JulkaisupaikkaLuxembourg
KustantajaEuropean Commission
Sivumäärä111
ISBN (elektroninen)978-92-79-88769-7
DOI - pysyväislinkit
TilaJulkaistu - 5 heinäkuuta 2018
OKM-julkaisutyyppiD4 Julkaistu kehittämis- tai tutkimusraportti taikka -selvitys

Julkaisusarja

NimiStatistical working papers
KustantajaEurostat
ISSN (elektroninen)2315-0807

Tieteenalat

  • 112 Tilastotiede

Lainaa tätä

Beresewicz, M., Lehtonen, R. T., Reis, F., Di Consiglio, L., & Karlberg, M. (2018). An overview of methods for treating selectivity in big data sources. (Statistical working papers). Luxembourg: European Commission. https://doi.org/10.2785/312232
Beresewicz, Maciej ; Lehtonen, Risto Tapio ; Reis, Fernando ; Di Consiglio, Loredana ; Karlberg, Martin. / An overview of methods for treating selectivity in big data sources. Luxembourg : European Commission, 2018. 111 Sivumäärä (Statistical working papers).
@book{9630c77e41ab46ec99316da3a6ab6a3b,
title = "An overview of methods for treating selectivity in big data sources",
abstract = "Official statistics is now considering seriously big data as a significant data source for producing statistics. It holds the potential for providing faster, cheaper, more detailed and completely new types of statistics. However, the use of big data brings also several challenges. One of them is the non-probabilistic character of most sources of big data, as very often they were not designed to produce statistics. The resulting selectivity bias is therefore a major concern when using big data. This paper presents a statistical approach to big data, searching for a definition meaningful from the statistical point of view and identifying its main statistical characteristics. It then argues that big data sources share many characteristics with Internet opt-in panel surveys and proposes this as a reference to address selectivity and coverage problems in big data. Coverage and the self-selection process are briefly discussed in mobile network data, Twitter, Google Trends and Wikipedia page views data. An overview of methods which can be used to address selectivity and eliminate, or mitigate, bias is then presented, covering both methods applied at individual level, i.e. at the level of the statistical unit, and at domain level, i.e. at the level of the produced statistics. Finally, the applicability of the methods to the several big data sources is briefly discussed and a framework for adjusting selectivity in big data is proposed.",
keywords = "112 Statistics and probability",
author = "Maciej Beresewicz and Lehtonen, {Risto Tapio} and Fernando Reis and {Di Consiglio}, Loredana and Martin Karlberg",
year = "2018",
month = "7",
day = "5",
doi = "10.2785/312232",
language = "English",
series = "Statistical working papers",
publisher = "European Commission",
address = "Belgium",

}

Beresewicz, M, Lehtonen, RT, Reis, F, Di Consiglio, L & Karlberg, M 2018, An overview of methods for treating selectivity in big data sources. Statistical working papers, European Commission, Luxembourg. https://doi.org/10.2785/312232

An overview of methods for treating selectivity in big data sources. / Beresewicz, Maciej; Lehtonen, Risto Tapio; Reis, Fernando; Di Consiglio, Loredana; Karlberg, Martin.

Luxembourg : European Commission, 2018. 111 s. (Statistical working papers).

Tutkimustuotos: Kirja/raporttiTutkimusraportti

TY - BOOK

T1 - An overview of methods for treating selectivity in big data sources

AU - Beresewicz, Maciej

AU - Lehtonen, Risto Tapio

AU - Reis, Fernando

AU - Di Consiglio, Loredana

AU - Karlberg, Martin

PY - 2018/7/5

Y1 - 2018/7/5

N2 - Official statistics is now considering seriously big data as a significant data source for producing statistics. It holds the potential for providing faster, cheaper, more detailed and completely new types of statistics. However, the use of big data brings also several challenges. One of them is the non-probabilistic character of most sources of big data, as very often they were not designed to produce statistics. The resulting selectivity bias is therefore a major concern when using big data. This paper presents a statistical approach to big data, searching for a definition meaningful from the statistical point of view and identifying its main statistical characteristics. It then argues that big data sources share many characteristics with Internet opt-in panel surveys and proposes this as a reference to address selectivity and coverage problems in big data. Coverage and the self-selection process are briefly discussed in mobile network data, Twitter, Google Trends and Wikipedia page views data. An overview of methods which can be used to address selectivity and eliminate, or mitigate, bias is then presented, covering both methods applied at individual level, i.e. at the level of the statistical unit, and at domain level, i.e. at the level of the produced statistics. Finally, the applicability of the methods to the several big data sources is briefly discussed and a framework for adjusting selectivity in big data is proposed.

AB - Official statistics is now considering seriously big data as a significant data source for producing statistics. It holds the potential for providing faster, cheaper, more detailed and completely new types of statistics. However, the use of big data brings also several challenges. One of them is the non-probabilistic character of most sources of big data, as very often they were not designed to produce statistics. The resulting selectivity bias is therefore a major concern when using big data. This paper presents a statistical approach to big data, searching for a definition meaningful from the statistical point of view and identifying its main statistical characteristics. It then argues that big data sources share many characteristics with Internet opt-in panel surveys and proposes this as a reference to address selectivity and coverage problems in big data. Coverage and the self-selection process are briefly discussed in mobile network data, Twitter, Google Trends and Wikipedia page views data. An overview of methods which can be used to address selectivity and eliminate, or mitigate, bias is then presented, covering both methods applied at individual level, i.e. at the level of the statistical unit, and at domain level, i.e. at the level of the produced statistics. Finally, the applicability of the methods to the several big data sources is briefly discussed and a framework for adjusting selectivity in big data is proposed.

KW - 112 Statistics and probability

U2 - 10.2785/312232

DO - 10.2785/312232

M3 - Commissioned report

T3 - Statistical working papers

BT - An overview of methods for treating selectivity in big data sources

PB - European Commission

CY - Luxembourg

ER -

Beresewicz M, Lehtonen RT, Reis F, Di Consiglio L, Karlberg M. An overview of methods for treating selectivity in big data sources. Luxembourg: European Commission, 2018. 111 s. (Statistical working papers). https://doi.org/10.2785/312232