📋
Scraping Your First Site
Now that everything is set up, let's get into the actual scraping! Don't feel bad if it takes a couple attempts, your first scrape is always the hardest.
1.
Selecting your first element!
To select an element to scrape, hover over the element on the display of the website. It should appear blue. Then, click the element. In the case of this Amazon page, you want to scrape the name of the product, so click that on the page.
2.
Selecting the others
Once clicked, your first element should now have a green outline. You might notice that other similar elements on the page are also highlighted in yellow. Because you want the name of every product on the page and not just the single element (the name of the HERSHEY’s product in this case), you'll also click on one of the yellow elements.
3.
Counting elements
Once clicked, every element that was highlighted in yellow is now selected to be scraped, which is indicated by the green highlight. You might also notice that in the panel labeled “Select candy_name” on the left side of the screen, the amount of elements selected for scraping is shown in parentheses (45).
4.
Viewing data types
Now, you need to choose the specific type of data to extract from the selections. This information is shown under Select candy_name(45). By default, the name and url of selected elements is extracted.
5.
Removing data types
There's no need to scrape the URL of the elements, so you can remove it by clicking on the trash can icon next to the Extract url command.
6.
Adding data types
Right now, ParseHub is only extracting the names of the candy products, but you also want the prices. In order to extract the prices relative to the names (as you want prices to be accurate for each candy product), you'll need to use a command called relative select.
7.
Relative product select
Relative select attaches newly selected elements to already selected elements. In this case, it would attach each price to a candy name, which is what you want. In order to use it, you first need to click on the already selected element. After doing so, an arrow will appear from the already selected element prompting you to click on the new element you want to attach to it. It should look like this:
8.
I'm still missing some!
You might notice that some of the prices were not selected. To select them, just repeat step 7 for those elements.
9.
Extraction time!
Let’s rename selection1 to more accurately reflect what it is selected. You can name it candy_price. Now to extract data from the relative selections, click the plus sign next to the command. Then, under advanced, click on “Extract”.
10.
Select a data type
Now, you need to specify the type of data that is extracted. By default, it's on the option “Text”, which extracts the text within the selected element. This is what you want since the price of each product is text data.
11.
Test the program!
To test if your program correctly extracts the product name and price, click the “Get Data” button. Then, click the Run button and wait for the data to be collected.
12.
Downloading the data
Wait for the data to be collected, and download it as a JSON. JSON is a file format that stores and transmits data objects and their attributes, perfect for your use case! After downloading the file, open it in VSCode and take a look at the data. It should look like this: