We’ve been turning on multiple streams of data from similar, yet separate API providers lately. Lighting up streams of 311, 511, 911, and other municipal data for use in city focused applications. Instead of routing the streams of data to interactive real time dashboards, websites, or mobile applications we are building data lakes. Each of the streams we turn on becomes a small creek that feeds into a larger data lake. For this particular effort we are writing the real time results from each data source to Amazon S3, dumping the raw JSON as it comes in to a series of buckets that help us organize the different types of data we are gathering.
To handle the hundreds of individual streams we are turning on, we have setup a framework for handling many long running server side scripts that subscribe to guide the streams we’ve initiated. Our frameworks handles initiating the long running jobs, streaming data into the buckets they are directed to use, and runs regular checks to make sure each job stays running. All each job needs to know is the Streamdata.io proxy URL it is supposed to subscribe to, and the bucket that it is supposed to stream any data it receives into. That is all. Depending on the API that is being proxied, we might turn streams on or off, sometimes based upon the time of day, or possibly for the weekend. When working with data from city government, sometimes the feeds dry up at night or on the weekends, and there really is no reason to have the infrastructure up and running 24/7.
If you are interested in learning more about streaming data from existing web APIs into data lakes on AWS, let us know. It doesn’t have to be city data. It could be market data, news, healthcare, education, or other sources of data. We are happy to help you find the sources of data, understand how you can proxy existing web APIs using Streamdata.io, and we’ll even help you think through what is possible using your AWS infrastructure. There is a lot of untapped potential out there in the growing number of web APIs. Especially when the providers don’t realize the potential of streaming their real time data, and routing it into streams that can feed different lakes of data, becoming a fresh source of innovation on the machine learning and artificial intelligence front.