Creating a Simple Web Scraping Function in Vanilla JS
Web Scraping is an amazing skill to know because the possibilities of use are endless. Today I’ll be showing you how to scrape elements of a website with just a few lines of code. All you will need installed is Node, npm, and one package called puppeteer.
To start things off, cd into the directory you want your web scraper to be in. Then, in your terminal, run the command: npm init -y
- npm init -y: npm init creates/initializes a new npm project, and -y means to default all answers to yes. This is a simple way to create a place for all your packages and dependencies to go.
After that is set up, you’re going to want to install the only package we will be using, which is puppeteer. In yourterminal, run the command: npm install puppeteer.
Simple enough! Now we are ready to begin.
First, well need to import puppeteer and create a new async function with an argument for the URL of the site we want to scrape (its important we use an async function because we will be using the await operator, which can only be used in async functions).
The first lines of code in our function are very important when using puppeteer:
Here, we are creating a new instance of browser, and then we are creating a new page in that browser, and finally loading our chosen URL into that browser. The await operator ensures that these are done in order, the next action not being executed until the one before it is complete. Now, we are set up to start grabbing our elements from the webpage.
First, you’re going to go to the website in which you want to scrape. Then, right click on the element you want, and click inspect to open the developer tools. When your desired element is highlighted, right click on that HTML and choose Copy -> Copy XPath
Great! We’ve located the element we want. Now, we have to save that element and extract the property from the HTML.
For this example, my goal is to extract the item image URL. Still, we need to use await for each of these, because the order is integral.
- The $x method evaluates our chosen XPath on the page, and returns an element to the array.
- Next, we’re extracting the image URL using by using getProperty to extract the src
- Finally, we need to convert our retrieved code into a readable format using jsonValue.
Thats all there is! You can run a console.log inside your function, and then call the function with a string of your target URL to see if you are retrieving the values that you intend.
For example: I scraped for the image of this green shirt.
Which, when I run node scraper.js, returns this:
Say I wanted more then just the image, I also wanted to scrape for the name, link, and price, I would just repeat the above steps, adjusting for the different elements I am scraping for.
For Example, if I wanted to scrape for the elements I mentioned above, it would look like this:
At the end of your async function, you must add a browser.close(), this simply closes the browser so when you run node scraper.js, it doesn’t run forever.