A guide on how to do Web Scraping in .NET C#, with code samples.
What is Web Scraping
In English, the word Scraping has different definitions but they are all within the same meaning.
to remove (an outer layer, adhering matter, etc.) in this way: to scrape the paint and varnish from a table.
the act of removing the surface from something using a sharp edge or something rough.
However, we are for sure interested into what Web Scraping means in Software.
In software, Web Scraping is the process of extracting some information from a web resource from its user interface rather than its legit APIs. Therefore, it is not like calling a REST API of a website to get some data, it is like retrieving the website page like the browser does, parse the HTML and then extract the data rendered into the HTML.
Why Would We Need to Scrap a Website
Simply, because we need to have the data presented on this website and the website is not providing a legit API for us to retrieve this data.
Is Web Scraping Legal
It depends on the web resource itself. Some websites have it written somewhere if it is legal or not and sometimes it is not written anywhere.
Also, there is another factor which is what you are going to do with the data you scraped. Therefore, always try to be cautious and keep yourself safe. Do your research first before jumping into implementation.
How to do Web Scraping
There are different ways of doing it but in most of the cases the same concept applies; you write some code to get the HTML using the website URL, you parse the HTML, and finally you extract the data you want.
However, if we only stick to this definition, we would be missing a lot of details.
In some cases, things are more complicated than that. It depends on the way the website is built.
For Static websites, where the data are already rendered into the HTML from the first instance, you can follow the same steps we described.
However, for Dynamic websites, where the data are not rendered into the HTML from the first instance, and they are loaded dynamically through JavaScript libraries and frameworks (like Angular, React, Vue,…), you need to follow another approach.
Basically what you do in this case is that you try to mimic what a web browser (like Chrome, Firefox, IE, Edge,…) does and then you get the final HTML from the virtual browser you used. Once you have the full HTML where the data is rendered, the rest would be the same.
Should We do this Ourselves from Scratch
No, we already have some libraries which we can use to achieve the expected results.
For example, here is a list of some libraries which we can use.
Performing Calls:
Parsing HTML:
Virtual Browser:
These are not the only libraries that we have to help on our Web Scraping project. If you search the internet you would find a lot more.
Scraping a Static Website
First, let’s start with trying to scrap some data from a static website. On this example, we are going to scrap my own GitHub profile https://github.com/AhmedTarekHasan
We would try to get a list of the pinned repositories on my profile. Each entry would be composed of the name of the repository and its description.
Therefore, let’s start.
Observing the Data Structure in the HTML
At the moment of writing this article, this how my GitHub profile looked like.
When I checked the HTML, I found the following:
All my pinned repositories are found inside the main container with this path: div[@class=js-pinned-items-reorder-container] > ol > li
Each pinned repository is contained inside a container with this relative path to the parent path: div > div
Each pinned repository, would have its name inside div > div > span > a, and its description inside p
Writing Code
Here are the steps I followed:
Created a Console Application
Solution: WebScraping
Project: WebScraper
Installed the NuGet package HtmlAgilityPack.
Added the using directive using HtmlAgilityPack;
Defined the method private static Task<string> GetHtml() to get the HTML.
Defined the method private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html) to parse the HTML.
Finally, the code should be as follows:
Running this code, you will get the following
For sure you can make use of some cleaning to the strings here but this is not a big deal.
As you can see, it is easy to use HttpClient and HtmlAgilityPack. All what you need is to get used to their APIs and then it would be an easy job.
What you also need to keep in mind is that some websites would require more work from your side. Sometimes a website would need login details, authentication tokens, some specific headers,…
All of this you can still handle with HttpClient or other libraries which you can use to perform a call.
Scraping a Dynamic Website
Now, we should try to scrap some data from a dynamic website. However, since I should be cautious before scraping a website, I would apply the same example as before but now with assuming that the website is dynamic.
Therefore, again, on this example, we are going to scrap my own GitHub profile https://github.com/AhmedTarekHasan
Observing the Data Structure in the HTML
This would be the same as before.
Writing Code
Here are the steps I followed:
Created a Console Application
Solution: WebScraping
Project: WebScraper
Installed the NuGet package HtmlAgilityPack.
Installed the NuGet package Selenium.WebDriver.
Installed the NuGet package Selenium.WebDriver.ChromeDriver.
Added the using directive using HtmlAgilityPack;
Added the using directive using OpenQA.Selenium.Chrome;
Defined the method private static string GetHtml() to get the HTML.
Defined the method private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html) to parse the HTML.
Finally, the code should be as follows:
Running this code, you will get the following
For sure you can make use of some cleaning to the strings here but this is not a big deal.
Again, using Selenium.WebDriver and Selenium.WebDriver.ChromeDriver is easy as you can see.
Final Words
As you can see, Web Scraping is not that hard but it actually depends on the website you are trying to scrap. Sometimes you might get across a website which needs some tricks to make it work.
That’s it, hope you found reading this article as interesting as I found writing it.
Comments