Home
/
Blog
/
How to Streamline Content Auditing with Automated Publish Date Extraction
January 5, 2025

How to Streamline Content Auditing with Automated Publish Date Extraction

Table of Contents

Auditing content feels like a chore because it is a chore.

Sifting through endless pages, hunting for outdated articles, and trying to make sense of what needs attention are tasks no one wants to add to their to-do lists. It’s frustrating, time-consuming, and often turns into a productivity black hole.

By automating one of the most tedious aspects of this process—extracting publishing dates—you can turn hours of work into minutes. 

All it takes is a little automation-know-how to quickly identify which content needs refreshing, uncover trends in your content calendar, and even analyze your competitors’ publishing strategies with ease.

In this guide, we’ll show you how to automate publish date extraction, helping you save time, work smarter, and take control of your content strategy. 

Whether you’re tackling a full-site audit or simply looking to boost your SEO efforts, this step-by-step process will give you the insights you need to stay ahead.

Why Automate Publish Date Extraction?

Content auditing is more than just housekeeping. 

It’s a way to gather insights that drive smarter decisions. And at the heart of this process lies one crucial detail: the publish date. 

Publish dates tell a story about your content’s lifecycle, revealing what’s fresh, what’s outdated, and where you have opportunities to improve. By automating the extraction of publish dates, you can unlock a wealth of actionable insights without the manual effort. 

Here’s how this data can transform your content strategy:

  • Content Auditing and Refresh: Easily pinpoint outdated articles that need updating, focusing your efforts on the pieces most likely to deliver results.
  • Competitor Insights: Benchmark your competitors’ publishing frequency to identify gaps in your strategy and opportunities to gain an edge.
  • Trend Analysis: Align your content calendar with historical performance by correlating publish dates with traffic spikes and seasonal trends.
  • SEO Strategy: Improve your crawl budget efficiency by identifying stale pages and strengthen internal linking between newer and older content.

Automating this process saves time and ensures accuracy and consistency. But best of all, it frees you to focus on higher-value tasks like optimizing, refreshing, or creating more impactful future content.

Step 1: Extract URLs from Your Sitemap

Now that we’ve established how valuable publishing dates can be, the next step is to gather the URLs of your content. 

To do this, you’ll start with your website’s sitemap—a blueprint that maps out all the pages on your site. By extracting URLs from your sitemap, you’ll create a focused list of content ready for publish date analysis.

Go to your website’s sitemap, usually located at <yourdomain>/sitemap.xml. For example: https://www.hypelocal.com/sitemap.xml.

You’ll see something like this:

Hypelocal Sitemap
Hypelocal Sitemap

Copy the content of the sitemap. Then, use ChatGPT to extract the URLs enclosed in this content.

Act as a professional data analyst. I have an XML sitemap containing URLs within 
<loc> tags.

Your task is to extract these URLs, ensuring that only the links are captured without any additional HTML tags. Please compile the extracted URLs into a CSV file with a single column labeled 'URL'. Make sure the CSV file is properly formatted and ready for download, as it will be used for further analysis. 

<Paste Sitemap Content Here>
    

Click here to access this free prompt.

Then, export the cleaned URLs into a CSV or spreadsheet for further processing.

Step 2: Clean and Filter the URLs

Now that you have your list of URLs, the next step is to separate the signal from the noise. 

Not all URLs in your sitemap are created equal—some may link to admin pages, categories, or other irrelevant sections.

  • Focus on blog content: Filter out URLs that don’t point to articles or key content pages.
  • Remove any headings: Make sure that it’s just data in the sheet.
  • Download the file: Once your list is refined, save it as a Tab-Separated Values (TSV) file.

Step 3: Set Up the Automation Script

Here’s where the magic of automation happens. 

You’ll use a Python script to extract publish dates for each URL in your file.

But first, you have to go to Google Colab and create a new notebook. Name your notebook something descriptive, like “Publish Date Extractor.”

Google Colab
Google Colab

Then, add the script below to extract the publishing dates.

!pip install beautifulsoup4

import csv
import requests
from bs4 import BeautifulSoup
from datetime import datetime

def fetch_published_date(url):
    """
    Fetches the published date of an article from its meta properties.

    Parameters:
    - url (str): The URL of the web page.

    Returns:
    - str: The published date in 'YYYY-MM-DD' format if available, else an empty string.
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            meta_tag = soup.find('meta', property='article:published_time')
            if meta_tag and 'content' in meta_tag.attrs:
                published_time = meta_tag.attrs['content']
                published_date = published_time.split('T')[0]  # Extracts date part
                return published_date
        return ""
    except Exception as e:
        return ""

# Assuming 'urls.tsv' is the file where each line is a URL
urls_file = 'urls.tsv'
results = []

with open(urls_file, 'r') as file:
    tsv_reader = csv.reader(file, delimiter='\t')
    for row in tsv_reader:
        url = row[0]  # Assuming each row contains a URL in the first column
        published_date = fetch_published_date(url)
        results.append((url, published_date))

# Save the results to a new TSV file
output_file = 'published_dates.tsv'

with open(output_file, 'w', newline='', encoding='utf-8') as file:
    tsv_writer = csv.writer(file, delimiter='\t')
    tsv_writer.writerow(['URL', 'Published Date'])  # Header
    for url, date in results:
        tsv_writer.writerow([url, date])
    

Click here to access and download this free script.

Once you’ve added the script, rename the file so you know when it was last used. Then, click the file icon in Colab and upload your urls.tsv file.

URL TSV File
URL TSV File

Ensure the file name matches the reference in the script. Once verified, run the script by pressing the play button.

Google Colab data verification
Run Code in Google Colab

It will extract publish dates for each URL and save the results to a new file called published_dates.tsv.

Step 4: Analyze Your Results

Once the script finishes running, download the published_dates.tsv file and open it in your preferred tool. 

The file will include two columns: the URL and its corresponding publish date.

If you’re analyzing content from your own site for SEO purposes, look at the older articles and update the ones needing it to improve performance or remain relevant.

If you’re analyzing content from a competitor’s site, compare publish dates to identify how often they’re producing content.

You can also use this data to align your publishing calendar with dates that have historically performed well for your audience, helping you maximize engagement and traffic.

Efficiency Through Automation

There are several practical applications to the insights you can gain from this data. And the process that delivered these insights doesn’t stop here. 

You can integrate automation into other parts of your SEO and content workflows, from tracking updates to monitoring seasonal trends. 

The key is to let technology handle the heavy lifting so you can focus on high-level strategy and decision-making.

Overall, publish dates might seem like a small detail, but when examined carefully, they offer a means for optimizing your content strategy. 

With automation, though, you can extract, analyze, and act on this data at scale—saving time, improving accuracy, and driving better results.