Articles in this section
Category / Section

Get web page content (scrape a URL) or page element in Sheets.

29 mins read

This guide is designed for non-coders and beginners who want to scrape URLs or extract data from web pages into Google Sheets by simply copying and pasting code into the app script.


Important Note:
We recommend not scraping URLs if it is not explicitly allowed. Always check the website's Terms of Service (TOS) before attempting to extract data. Proceed at your own risk. Violating a website's TOS may have legal consequences


What Is the GETURL Function?

The GETURL function allows you to extract content from web pages directly into your Google Sheets. It can fetch the page's title, headings, meta descriptions, paragraphs, and more. You can specify what you want to extract, or let the function retrieve a default set of elements like the page title and main headings.


Step-by-Step Setup

1. Open Your Google Sheet

  • Go to Google Sheets.
  • Open an existing spreadsheet or create a new one.

2. Open the Apps Script Editor

  1. In the menu bar of your spreadsheet, click on Extensions.
  2. Select Apps Script. This will open the Apps Script editor in a new tab.

3. Paste the Code

  1. In the Apps Script editor, remove any placeholder code that appears by default in appscript editor.
  2. Copy the entire  code provided in this link and paste it into the app script editor.

4. Save and Authorize

  1. Click the disk icon (or press Ctrl + S) to save the script.
  2. Give the script a name like WebScraper.
  3. When you save or run the function for the first time, it will ask for authorization.
    • Click Review Permissions.
    • Select your Google account.
    • Review the permissions and click Allow.
      This allows the script to access external URLs and update your spreadsheet.

How to Use the GETURL Function

Once the script is set up, you can use the GETURL function directly in your Google Sheets.

Function Format:

=GETURL(url, [component])

Parameters:

  1. url (required):
    The webpage URL you want to extract data from.
    Example: "https://example.com"

  2. component (optional):
    The specific part of the webpage to extract. If not provided, the function will return a default set of components (like the page title and headings).
    Example: "title" for the page title, "h1" for the main heading, or "meta:description" for the meta description.


Examples:

1. Extract the Title of a Page

In a cell, type:

=GETURL("https://example.com", "title")

This will fetch the <title> tag from the webpage and display it in the cell.

2. Extract All <h1> Headings

=GETURL("https://example.com", "h1")

This will fetch all the <h1> headings on the page.

3. Extract Meta Description

=GETURL("https://example.com", "meta:description")

This will fetch the meta description content from the page.

4. Use the Function Without a Component

=GETURL("https://example.com")

This will extract multiple elements (title, headings, and paragraphs) as defined in the script’s settings.


Available Settings in the Script

You can adjust the behavior of the function by modifying settings directly in the Apps Script editor. These settings are found at the top of the script.

  1. max_character_returned (default: 10000):
    Limits the number of characters returned by the function to prevent excessively long outputs. Increase this value if you want more content in your results.

    Example:

    var max_character_returned = 2000; // Allow up to 2000 characters
  2. follow_redirect (default: true):
    Determines whether the function follows redirects (e.g., when a URL redirects to another page).

    Example:

    var follow_redirect = false; // Do not follow redirects
  3. retry_max (default: 3):
    The maximum number of times the function retries fetching a URL if it encounters an error.

    Example:

    var retry_max = 5; // Retry up to 5 times
  4. user_agent (default: "Mozilla/5.0"):
    The User-Agent string sent with the request. This can be changed if you encounter issues with websites blocking certain types of requests.

    Example:

    var user_agent = "MyCustomAgent/1.0"; // Custom User-Agent
  5. defaultExtractionList:
    Defines the components to extract when no component parameter is provided. By default, it includes title, h1, h2, p, etc. You can add or remove components from this list.

    Example:

    var defaultExtractionList = ["title", "h1", "h2", "h3", "p"];

Common Issues and Troubleshooting

1. Authorization Issues

If you encounter an error when trying to use the function, ensure that you have authorized the script.

  • Go to Extensions > Apps Script.
  • In the Apps Script editor, click Run > geturl to trigger the authorization process.

2. Malformed HTML

Some websites may have poorly formatted HTML, which can cause parsing errors. In such cases, the function uses fallback methods (like regex) to extract content. However, the results might not be perfect for heavily JavaScript-driven pages.

3.Blocking Websites

Some websites may block automated requests or requests that don’t appear to come from a regular browser. This could result in a timeout error or a response with no content.

Important Note:
We recommend not scraping URLs if it is not explicitly allowed. Always check the website's Terms of Service (TOS) before attempting to extract data. Proceed at your own risk. Violating a website's TOS may have legal consequences

4. JavaScript-Rendered Pages

If the webpage relies on JavaScript to load its content dynamically, the function may only fetch the initial HTML. This can lead to incomplete results. Unfortunately, Apps Script doesn’t support executing JavaScript on fetched pages.

4. Rate Limits

If you use the function frequently or scrape many URLs, some websites may block your IP temporarily. To avoid this:

  • Add a delay between requests (manually or via script logic).
  • Use a custom user_agent to mimic a real browser.

5. Blank Results

If the function returns blank:

  • Ensure the URL is correct and accessible.
  • Check if the requested component exists on the page.
  • Increase the retry attempts (retry_max) in the script if the page is slow to load.

Summary of Features

What It Can Extract:

  • Title: "title"
  • Headings: "h1", "h2", ..., "h6"
  • Text Content: "p", "div", "ul", "li", "table"
  • Images: "img" (all image URLs)
  • Meta Tags: "meta:description", "meta:keywords", "meta:title"

When to Use a Parameter:

  • If you know the specific element to extract (e.g., "h1"), provide it as the second parameter.
  • If you want multiple elements (title, headings, etc.), omit the second parameter.

Why Use This Script?

  • No Coding Needed: Once set up, use it like any other Google Sheets formula.
  • Highly Customizable: Adjust settings to fit your needs.
  • Reliable Multi-Step Parsing: Attempts XML parsing first, then uses fallback methods for problematic pages.

With this setup, you can easily scrape and analyze data from web pages without leaving your spreadsheet!

Was this article useful?
Like
Dislike
Help us improve this page
Please provide feedback or comments
Access denied
Access denied