V-model of Big Data

Terabytes (TB) to petabytes (PB) of data

900 gibibit/year (GiB/a)

Various data sources and types.

MB 0.99

Big Data

The landscape of vascular research and quality improvement has changed remarkably over the decades. For generations of physicians and researchers, randomized controlled clinical trials (RCT) have remaining the only possibility to validly investigate causalities between interventions and outcomes.  In a special issue of the Nature titled Big Data – Science in the Petabyte Era, the authors introduced this new catchphrase to the bioscientific audience in 2008, heralding the start of a new age in medical research. The use of this inconsistently defined concept is considerably increasing since then but to date no consensus exists on what Big Data is by definition. Several authors developed so called “V-models” as individual concepts to define the characteristics of Big Data in general. Thus, Big Data is characterized by its increasing Volume, usually starting in size of terabytes to petabytes. Furthermore, the Velocity of data harvesting and processing is increasing. Thirdly, the Variety of different data types ranges from unstructured data (e.g. free text or imaging) to (semi-)structured data (e.g. tables, relational databases). Other V-concepts also include Variability (contextual quality) or Veracity as trustworthiness and Validity of the data. Especially the last mentioned characteristics are discussed controversially when it comes to real-world-data research. 

Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.Gartner IT Glossary

Data Privacy

In the light of Big Data applications in modern medicine, the various forms of registries becoming increasingly more important. Since Sweeney (2002) introduced the term “k-anonymity” as a model for protecting privacy in real-world data systems, the importance of this aspect increased rapidly. A crosslink between growing data sources potentially allows to re-identify single individuals. To meet the changing requirements in the field of digital healthcare, the European Commission proposed a comprehensive reform of data protection rules in the European Union (EU) After a transition phase the novel regulations will come into force from 25 May 2018 and then replace the existing Federal Data Protection Act. To conscientiously deal with this subject is of utmost importance before implementing registry-based projects in medical research or quality improvement. To protect patient rights in times of technical progress and increasing amounts of data, appropriate data protection strategies are needed. The reform of the data protection legal framework aims to consider these aspects and to harmonize data privacy across the EU through a total of 99 articles and 173 recitals. Local Data Protection Authorities will monitor compliance. A fine of up to 20 million euros or 4% of global annual turnover means a significant increase in costs of non-compliance. Find publication here!

Registry and Claims Data

In addition to registry-based research, another data source gets increasing attention. Health-insurance claims data and alternative administrative data sources could serve as valuable supplementation of primary research. The importance of administrative data for research and quality improvement will continue to increase in the future. When discussing the internal and external validity of this data source, one has to distinguish not only between its intended usages (research vs. quality improvement), but also between the included diseases and/or treatment procedures. Data validity is largely dependent on the clinical relevance of diagnosis, where major complications such as myocardial infarction and stroke demonstrate a high level of validity. The lack of standards available to objectively assess the validity of administrative data further complicates the matter. Nevertheless, when used under conscientious consideration of the above-mentioned limitations and after fulfilling certain predefined requirements, such data sets can serve as valuable information sources. Find publication here!

  • Pseudonymized personal data: estimates of incidence and
    prevalence, long-term accounts of disease progress and care
    history, degree of health care utilization
  • Population-oriented: comparison of the general population with
    the insured population, including socio-demographic data
  • Precise extrapolation due to the representativeness of the sample
    cohort possible
  • Non-biased involvement of all insured persons in the sample: no
    self-selection by researchers, inclusion of less accessible groups
  • Completeness from routine acquisition (external validity)
  • Cross-sector detection (intersectorality)
  • Low cost of collection and use
  • Long periods of observation
  • Large observation groups
  • Mass data: detectability of rare events or risks