Counting URLs Present Within A Stack Overflow Questions And Answer Stream

Kin Lane

7 years ago

We recently showcased how you can proxy the Stack Exchange API using Streamdata.io. The Stack Exchange API provides programmatic access to a variety of QA websites including the wildly popular Stack Overflow questions and answers, where developers can ask questions, and share answers across a variety of topics. The API provides access to a wealth of data about what is happening each day within the tech sector and is something we wanted to explore more when it comes to building machine learning models.

Using the Stack Overflow platform we’d like to better understand which companies people are talking about, and eventually, train a set of machine learning models that can be used to make some predictions. This project will rollout in several stages, but to begin working towards this objective, we wanted to begin by processing all URLs that are referenced within each question and answer. We already showed how to proxy the questions API from the Stack Exchange API, establishing a stream of questions and answers from Stack Overflow. Now we’d like to just add some code that pulls any URLs present in the streaming responses, and stores them separately for aggregating and counting by domain.

Counting URLs Present Within A Stack Overflow Questions And Answer Stream Our goal is to establish real-time counts of the URLs being referenced by programmers on Stack Overflow. We’ll be aggregating them, and counting them up by the day, week, and month. Seeing which domains are the most referenced on the site. We want to have enough historical data to be able to train some machine learning (ML) models that we can then keep up to date in real time, but also leverage to help predict future trends by week, month, or several months out. We’d like to understand the potential for using Stack Overflow as a way of keeping our finger on the pulse of which technological trends are growing in strength, and be able to identify newer trends early on. By aggregating URLS being referenced, we should be able to better understand which domains are getting the most mindshare amongst developers today.

Once we have a years worth of historical data and pulling URLs in real time via the Stack Overflow streams we are setting up, we’ll start looking at other data points we can aggregate via the platform. Beyond URLs, we’ll look for specific company and product names, as well as variations on those, trying to understand beyond just the domain, getting more specific about what we track. The Stack Exchange API provides us with the ability to look at tags, and their popularity, but we are guessing there is a lot more unstructured data present in the questions and answers that aren’t represented in the tags. Across the URL, company, product, and tagging aggregation, we think we’ll be able to train some interesting ML prediction models that can take some guesses at what trends will keep growing, and which are taking more of a downward trend–providing some interesting data that can drive technological investment and adoption.

**Original source: streamdata.io blog

Share this article:

Share this article: