Now that the General Data Protection Regulation (GDPR) is in full effect, and impacting business in the European Union and beyond, knowing where your data streams are coming from will be very important. Scraping of data and content from across the web has become commonplace, but in a world were you will need the consent of users before you acquire any of their personal data, the buying or selling of scraped data will become a costly affair, going well beyond the direct costs, and potentially incurring 3% revenue costs when caught by regulators.
We come across a number of opportunities to scrape or obtain data by alternative means, but we have worked hard to identify the original source of the data we work with, and always make sure we develop relationships with the API providers we are profiling and targeting as part of the Streamdata.io API Gallery. This extra effort will pay off down the road when it comes to empowering our customers to subscribe to the data and content streams we’ve identified. Streamdata.io customers will know that we’ve done our homework when it comes to the provenance of the data streams we are providing them and that they can count on the quality of the topical streams we are helping them discover.
One aspect of this data stream provenance we are working on is making sure there is a machine-readable record of this fact. It will take some work to achieve this goal, but we’d like to see a data provenance collection added to the APIs.json for each of the entities we are profiling. Adding to the machine-readable index for each set of APIs we are adding to the gallery, allowing provenance to be evaluated at run-time. While also allowing providers to avoid or filter out APIs and entities that have not provided the required background of what their source of data is. Something that will be critical to doing all of this at scale, and streaming not just a handful of APIs, but potentially hundreds, or thousands of APIs into data lakes.
While all of this might seem like a hassle now, it will only improve the quality and the value of data in the long run. The data stewards who can provide provenance on the data they are publishing with APIs will be able to establish trust with other API consumers, and API service providers like Streamdata.io. Data consumers will learn to stay away from subscribing to streams of data that do not have the required provenance and history that can be trusted. Only tapping into the highest quality data streams, while also ensuring they only receive the most relevant information by leveraging event-driven opportunities like Streamdata.io brings to the table. Something we think will help us stand out from the competition, in this new era of doing business on a GDPR regulated web.