We’ve been turning on multiple streams of data from similar, yet separate API providers lately. Lighting up streams of 311, 511, 911, and other municipal data for use in city focused applications. Instead of routing the streams of data to interactive real time dashboards, websites, or mobile applications we are building data lakes. Each of the streams we turn on becomes a small creek that feeds into a larger data lake. For this particular effort we are writing the real time results from each data source to Amazon S3, dumping the raw JSON as it comes in to a series of buckets that help us organize the different types of data we are gathering.
To handle the hundreds of individual streams we are turning on, we have setup a framework for handling many long running server side scripts that subscribe to guide the streams we’ve initiated. Our frameworks handles initiating the long running jobs, streaming data into the buckets they are directed to use, and runs regular checks to make sure each job stays running. All each job needs to know is the Streamdata.io proxy URL it is supposed to subscribe to, and the bucket that it is supposed to stream any data it receives into. That is all. Depending on the API that is being proxied, we might turn streams on or off, sometimes based upon the time of day, or possibly for the weekend. When working with data from city government, sometimes the feeds dry up at night or on the weekends, and there really is no reason to have the infrastructure up and running 24/7.
Our goal is to have hundreds of 311, 511, and 911 streams flowing into our Amazon S3 data lake, increasing its volume based upon the data flowing from hundreds of cities. Then we can use the data lake as a source of data for a variety of machine learning projects we have on the road map. As part of our profiling of APIs for inclusion in the Streamdata.io API Gallery, we have been targeting some machine learning APIs from Amazon, Google, Azure, from marketplaces like Algorithmia, and others. We want to see what is possible when we take the city data we’ve been gathering and begin indexing, understanding sentiment, and other common ML approaches to understanding large amounts of data. We aren’t sure what all the data will tell us, but it all makes for some interesting prototypes, and stories to tell here on the blog.
If you are interested in learning more about streaming data from existing web APIs into data lakes on AWS, let us know. It doesn’t have to be city data. It could be market data, news, healthcare, education, or other sources of data. We are happy to help you find the sources of data, understand how you can proxy existing web APIs using Streamdata.io, and we’ll even help you think through what is possible using your AWS infrastructure. There is a lot of untapped potential out there in the growing number of web APIs. Especially when the providers don’t realize the potential of streaming their real time data, and routing it into streams that can feed different lakes of data, becoming a fresh source of innovation on the machine learning and artificial intelligence front.