We are beginning to roll out a number of AWS Lambda functions for connecting to a variety of APIs which then allow you to stream the results into your existing AWS infrastructure. Next on our list of serverless streaming connectors to deploy, is one for streaming GitHub repository search into an AWS data lake, allowing you to train machine learning models on any GitHub repository search, as well as use in a variety of web, mobile, or other applications. Providing a plug and play, scalable way to stream data available via existing APIs into your organization’s data lake.
The Streamdata.io AWS Lambda streaming connector for GitHub repository search is available in the AWS Serverless Application Repository. Allowing you to deploy the serverless function within your existing AWS infrastructure, proxy GitHub repository search using Streamdata.io, and publish the results in real-time to an AWS S3 bucket. Turning on a scalable faucet of data from a few, or as many GitHub repository searches as you’d like to conduct, and orchestrate the streams based upon events or a schedule using AWS Cloudwatch Events–which lets you manage each function, and pay for only when the streams are running.
To run the functions you’ll need a Streamdata.io account and application key–something that takes a minute to set up. You’ll also need an AWS account to deploy the Lambda function into, and your S3 storage activated to establish your data lake. Then you need a Github account, and token to be able to make ongoing calls to the GitHub API–you can get a token via your GitHub account, under personal developer tokens. Once set up, all you do is add your Streamdata.io key, and GitHub token into the Lambda function, execute the script, and it begins streaming into your S3 bucket. Streamdata.io will proxy and cache the GitHub API, sending only updates to your AWS Lambda function which then publishes the incremental updates to your designated S3 bucket based upon the schedule you set up, using AWS Cloudwatch Events.
The next version of our functions will abstract away accounts and keys needed for Streamdata.io and GitHub, making your AWS account the only thing you need. However, this function should get you started streaming GitHub repository search into your data lake. Allowing you to monitor any activity via the social coding platform. Enriching your data lake with relevant signals, which can be used to train machine learning models, drive dashboards, web, mobile, and any other application you need. Efficiently tapping into valuable 3rd party data sources like GitHub to find the signals that matter, and make them available for use across your existing infrastructure in the AWS cloud.