Streaming 3rd Party APIs Into An Amazon S3 Data Lake

Amazon S3 provides a proven, high quality, volume data storage solution that makes for a great data lake. You can operate Amazon S3 in different geographic regions, and scale it up in both size of storage, and levels of access. Because Amazon S3 is API-driven, it makes it easy to use the storage platform as a destination for streams of data coming from a variety of existing 3rd party APIs. Allowing you to build data lakes, ponds, and swimming pools from a mix of valuable data stores using data extracted from the wealth of publicly available APIs across the online landscape.

We’ve been exploring different ways to make streams of data delivered on top of existing 3rd party APIs as plug and play as possible–something Amazon Web Services makes pretty easy using Lambda functions. We recently created a prototype market data API stream using Lamdba, and next, we are working on making it deliver data into an Amazon S3 data lake. Turning AWS Lamda into little serverless streams of data, using common 3rd party APIs as the source, and Amazon S3 data lakes as the destination. Allowing companies, organizations, institutions, and government agencies to turn on and off valuable streams of data from any existing API out there.

We are developing a suite of serverless streaming connectors that connect to existing APIs like Twitter, Stack Exchange, Reddit, Github and deliver valuable data and content into Amazon S3 buckets for use in other applications. We have a growing number of customers asking us for where the best data is, as well as how they can deliver as real-time streams into their existing infrastructure–which often times is Amazon Web Services. Pushing us to become an API discovery, as well as an API streaming delivery service. Pushing us to turn the valuable API definitions we’ve been publishing as part of the API Gallery into AWS platform connectors using Lambda.

We are almost done with the first set of serverless streaming connectors taking Stack Exchange and Reddit and publishing to Amazon S3. Once ready, we will be publishing them to Github, as well as the AWS Serverless Application directory. If there are specific APIs or data sources you’d like to see an AWS Lambda connector developed for, feel free to let us know, and we’ll be happy to prioritize our roadmap, based upon the real-time streams you are needing. If an API is already in the API Gallery, it is pretty straightforward for us to generate a Lambda connector using the OpenAPI definitions we use to drive the gallery–speeding up the process in which you can begin delivering 3rd party data into your AWS S3 data lake for access across your existing AWS infrastructure.

AI in Finance White paper - data lake

**Original source: blog