We recently showcased how you can proxy the Stack Exchange API using Streamdata.io. The Stack Exchange API provides programmatic access to a variety of QA websites including the wildly popular Stack Overflow questions and answers, where developers can ask questions, and share answers across a variety of topics. The API provides access to a wealth of data about what is happening each day within the tech sector and is something we wanted to explore more when it comes to building machine learning models.
Using the Stack Overflow platform we’d like to better understand which companies people are talking about, and eventually, train a set of machine learning models that can be used to make some predictions. This project will rollout in several stages, but to begin working towards this objective, we wanted to begin by processing all URLs that are referenced within each question and answer. We already showed how to proxy the questions API from the Stack Exchange API, establishing a stream of questions and answers from Stack Overflow. Now we’d like to just add some code that pulls any URLs present in the streaming responses, and stores them separately for aggregating and counting by domain.
Once we have a years worth of historical data and pulling URLs in real time via the Stack Overflow streams we are setting up, we’ll start looking at other data points we can aggregate via the platform. Beyond URLs, we’ll look for specific company and product names, as well as variations on those, trying to understand beyond just the domain, getting more specific about what we track. The Stack Exchange API provides us with the ability to look at tags, and their popularity, but we are guessing there is a lot more unstructured data present in the questions and answers that aren’t represented in the tags. Across the URL, company, product, and tagging aggregation, we think we’ll be able to train some interesting ML prediction models that can take some guesses at what trends will keep growing, and which are taking more of a downward trend–providing some interesting data that can drive technological investment and adoption.