We’ve been developing more serverless applications that run on AWS lately, which let you stream data from a variety of API providers into your Amazon S3 data lake(s). Continuing with the same theme of building data lakes from 3rd party APIs, we wanted to leverage the GitHub platform for building data lakes by streaming data from 3rd party APIs, as a simple JavaScript package that runs entirely on GitHub–pushing the boundaries of what is possible with APIs, GitHub, and data lakes. As part of this work, we’ve published our first streaming topic subscriptions using Reddit to GitHub. This JavaScript micro application runs on GitHub, hosted by GitHub pages, and leverage Jekyll, HTML, CSS, and JavaScript to deliver the application interface. The micro application connects to the Reddit API, allows for five separate searches that are defined using OpenAPI, which are then proxied using Streamdat.io delivering Server-Sent Events (SSE) streams for each topic, which gets displayed on the screen at the bottom of the page. Providing five separate topical streams from Reddit, that can be triggered and monitored via the browser on the GitHub Pages hosted web application.
Reddit allows for API searches without being authenticated, but if you authenticate using OAuth, they’ll let you make more API calls, which allows for longer running streams. Additionally, if you authenticate with GitHub, it will allow you to save the topical stream to a GitHub repository stored within your GitHub account. The feature requires repository level access to your GitHub to turn on this feature and allow for data to be saved to the repository, but it is something that opens up an entirely new way for developing data lakes, that can then be put to work using Git, or with the GitHub API. Streaming in data from a variety of 3rd party API sources, and storing it within private GitHub repositories, for use in a variety of applications, and the training of machine learning models.
Right now the prototype will only stream one topic at a time, but it should demonstrate the potential for streaming from Reddit using their API, Streamdata.io, and GitHub. If you have any questions about the prototype, feel free to submit an issue for the GitHub repository. We are working on deploying a Stack Exchange, Twitter, and GitHub search versions of the same prototype. For right now we’ll keep publishing collections of topics, publishing streams of data within intended areas, which is something we’ll eventually open up to wider search capabilities in future versions. This streaming Reddit topic subscription prototype is not production ready and just meant to demonstrate the potential of streaming APIs on GitHub. If you are looking for a specific implementation or would like to obtain a more stable version of this micro application, please let us know.