We are beginning to roll out a number of AWS Lambda functions for connecting to a variety of APIs which then allow you to stream the results into your existing AWS infrastructure. Next on our list of serverless streaming connectors to deploy, is one for streaming Reddit searches into an AWS data lake, allowing you to train machine learning models on any Reddit search, as well as use in a variety of web, mobile, or other applications. Providing a plug and play, scalable way to stream data available via existing APIs into your organization’s data lake.
The Streamdata.io AWS Lambda streaming connector for Reddit searches is available in the AWS Serverless Application Repository. Allowing you to deploy the serverless function within your existing AWS infrastructure, proxy Reddit API searches using Streamdata.io, and publish the results in real time to an AWS S3 bucket. Turning on a scalable faucet of data from a few, or as many Reddit searches as you’d like to conduct, and orchestrate the streams based upon events or a schedule using AWS Cloudwatch Events–which lets you manage each function, and pay for only when the streams are running.
To run the functions you’ll need a Streamdata.io account and application key–something that takes a minute to set up. You’ll also need an AWS account to deploy the Lambda function into, and your S3 storage activated to establish your data lake. Then you need a Reddit account, and token to be able to make ongoing calls to the Reddit API. Once set up, all you do is add your Streamdata.io key, and Reddit token into the Lambda function, execute the script, and it begins streaming into your S3 bucket. Streamdata.io will proxy and cache the Reddit API, sending only updates to your AWS Lambda function which then publishes the incremental updates to your designated S3 bucket based upon the schedule you set up, using AWS Cloudwatch Events.
The next version of our functions will abstract away accounts and keys needed for Streamdata.io and Reddit, making your AWS account the only thing you need. However, this function should get you started streaming Reddit searches into your data lake. Allowing you to monitor conversations that occur via the “front page of the web”, across any topic you choose. Enriching your data lake with relevant signals, which can be used to train machine learning models, drive dashboards, web, mobile, and any other application you need. Efficiently tapping into valuable 3rd party data sources like Reddit to find the signals that matter, and make them available for use across your existing infrastructure in the AWS cloud.