{"pageProps":{"page":1,"posts":[{"date":"2022-05-18T22:45:00Z","layout":"post","title":"Web Scraping static web pages in .Net Core","author":["Christopher Long"],"tags":["Web Scraping",".Net Core","C#"],"excerpt":"OverviewI was talking to a developer friend the other day about starting a little side project where you could keep track of the product prices on supermarket websites and trend them over time. This would allow a user to see what’s increasing in price at any given moment. Unfortu...","body":"\n## Overview\nI was talking to a developer friend the other day about starting a little side project where you could keep track of the product prices on supermarket websites and trend them over time. This would allow a user to see what’s increasing in price at any given moment. Unfortunately, none of the major supermarkets offer convenient API’s to easily get this information. Tesco’s used to offer a handy API, but it was decommissioned in 2016, sad face.\nThis left me with a slightly less elegant method of getting the data I wanted, web scraping. This is something I have very little experience in, so it’s time to start at the basics.\n\n## What is Web Scraping?\nSimply put, web scraping is your program using a regular old HTTP request to go off to a website of your choosing and grab all the HTML/CSS from the page. Usually, you’d then want to parse the results, only saving the data you’re interested in.\nYou need to use different methods of scraping for different types of web pages. To start with we’re going to look at how to get data from static web pages.\n\n## Prerequisites\nI’m going to be building this in a .Net Core Web Application using MVC. So, if you’re following along, go ahead and create a new project.\n\n \nI’ll also be making use of the “HTML Agility Pack” which you can find via the NuGet package manager:\n\nThis package makes parsing the HTML content much more intuitive.\n\n## Retrieving the HTML\nFortunately for us, .Net Core includes native asynchronous HTTP request libraries. Lets get us some HTML!\nFirst step is choosing the web page we want to scrape, as I mentioned we’re starting with static pages, so I’ve chosen a Wikipedia page dear to all our hearts.\n\n \nAs you can see above I’ve placed the URL variable in the HomeController Index() method as this is the default call when you first open an MVC web application.\nWhen using the .NET HTTP library, a static asynchronous task is returned from the request, so we’ll need to build out the code to handle the request functionality in its own static method.\n\n \nI’ve added this to the HomeController class to keep things simple. I’ve also updated the Index() method to call the method on our URL.\n\n \nLet’s quickly test that we’ve set things up correctly and are receiving the HTML data. The easiest way to do this is to place a breakpoint on the return View(); line and run the program.\n\nLooks good! Now to parse the data.\n\n## Parsing the HTML\nThe first step in parsing the HTML is knowing what we’re looking for. For this exercise I’m happy with just getting all the individual programming languages held on the page. \nInspecting the programming names in the chrome dev tools shows us that each name is contained in an `