Bot Deterrence

Once you have Tollbit set up for your website, you can optionally set up bot deterrence settings on your existing cloud cybersecurity platform to forward known bot traffic to your new tollbit subdomain. Note that at this time, we are only able to forward traffic from known user agents. For more advanced bot detection, see the documentation for your cybersecurity platform. Note that this section assume that you are working with a clean environemnt for your cybersecurity platform. In reality, we understand that you likely already have additional infrastructure set up and have some measure of bot detection and blocking. In those cases, the solutions we provide here can be treated as purely additive options, in that you can keep your current bot mitigation strategy and also implement what we have here. You can even keep your current solution and just forward all bot traffic to your tollbit subdomain. Bots that get forwarded will, instead of getting to access your page, see something like the following:

{
    "message": "You are not authorized to access this content without a valid TollBit Token. Please follow this URL to find out more.",
    "url": "https://tollbit.com"
}

This will be the rate of the page that you have set for the page the bot was trying to access.

Setting up your Subdomain

The first step of getting bot deterrence set up is to create and route your tollbit subdomain. We will set up the tollbit subdomain in your registrar. This allows users to easily access the content mirrored on your main website. This will not affect your main website's SEO, load times, etc.

Navigate to your DNS provider and create a new NS record for the subdomain. If your main website were www.example.com or example.com, the subdomain must be tollbit.example.com. Point the NS records at the following domains:

  • ns1.edge.tollbit.com
  • ns2.edge.tollbit.com
  • ns3.edge.tollbit.com
  • ns4.edge.tollbit.com

Once you have done that, please email team@tollbit.com and we will add the finishing touches on our end to fully route your subdomain to our platform.

Create and Activate Rates

This is the core of the Tollbit product, and where you can set prices on your content. At the moment there are a few ways to set rates. Rates currently have the following heiarchy: bot -> page -> keyword -> time -> directory. This means that when determine the price of a page for a particular request, we first check if that request is from a bot that matches any of your bot rates. If so, we return that rate. If there are no bot matches, we then check if the requested page matches any of your page rates. We keep going down the chain, trying to find a match, and if we find no matches at the end, the price is assumed to be 0.

Bot Rates

These rates allow you to set special rates for any specific bots that access your platform, and will override all other rates. You should set this type of rate if you have struck a licensing deal with a company that employs a particular user agent, and want to give them special rates to access your content (usually 0).

Page Rates

These rates allow you to set a rate for a specific page on your website. If you have any page that you know gets high bot traffic (i.e. sports or election results), or if you have a very high quality piece of original reporting, you can set a special rate for that page. This will override all other rates except bot rates.

Keyword Rates

These rates allow you to set a price for pages that may contain a particular keyword. If you know that there are some high profile sporting events coming up, you may want to set a higher price for pages that mention football or basketball. This rate is still in beta.

Time Rates

This rate allows you to define how the price of a page should change over time. You set a starting price for just published or updated content, and can define what the price of the content should be after a set amount of time passes from the last modified time. This rate allows you to automatically price content without needing to constantly manage the dashboard.

Directory Rates

These rates let you set a flat fee for all the content within a page directory of your site. For a quick way to instantly price your content, you can set a price for your top level directory, and this will automatically apply to all pages. You can drill down into further subdirectories and set pricing there, and it will override any price in a higher directory. For example, you can set a base price of $0.001 at the root level, and then set a price of $0.005 for the /sports directory. Everything under /sports will now be $0.005 while something under /cooking will still be $0.001.

CDN BOT BLOCKING AND FORWARDING

Once you have your bot paywall set up, you can forward bots to your tollbit subdomain to let bots know that your content is priced. This is an optional and open ended feature of TollBit, and the rest of your platform, such as analytics, will still work without this piece. The following examples are suggestions on how bot forwarding might be implemented, but you can certainly keep your existing bot detection methods or implement these differently.

AWS WAF + CloudFront

You can use a combination of AWS Web ACLs and CloudFront to detect and redirect bots. This example will use a Web ACL with a WAF rule to detect bots, and then have CloudFront redirect bot traffic.

First, go to the WAF & Shield and create a new Web ACL. Ensure that the ACL being created is for CloudFront distributions. Add your existing CloudFront distribution to this ACL under the "Associated AWS resources" section of the page.

Once you've created the ACL, you can choose any rules you'd like to enable bot detection. AWS Marketplace has managed bot detection rules that you can add to your ACL. We will provide our own WAF rule as well. To use our WAF rule, select the option for using your own rules and rule groups, and use the JSON editor. Copy and paste the following rule:

{
  "Name": "cloudfront-agent-rule",
  "Priority": 0,
  "Statement": {
    "OrStatement": {
      "Statements": [
        {
          "ByteMatchStatement": {
            "SearchString": "ChatGPT-User",
            "FieldToMatch": {
              "SingleHeader": {
                "Name": "user-agent"
              }
            },
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ],
            "PositionalConstraint": "CONTAINS"
          }
        },
        {
          "ByteMatchStatement": {
            "SearchString": "PerplexityBot",
            "FieldToMatch": {
              "SingleHeader": {
                "Name": "user-agent"
              }
            },
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ],
            "PositionalConstraint": "CONTAINS"
          }
        },
        {
          "ByteMatchStatement": {
            "SearchString": "GPTBot",
            "FieldToMatch": {
              "SingleHeader": {
                "Name": "user-agent"
              }
            },
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ],
            "PositionalConstraint": "CONTAINS"
          }
        },
        {
          "ByteMatchStatement": {
            "SearchString": "anthropic-ai",
            "FieldToMatch": {
              "SingleHeader": {
                "Name": "user-agent"
              }
            },
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ],
            "PositionalConstraint": "CONTAINS"
          }
        },
        {
          "ByteMatchStatement": {
            "SearchString": "CCBot",
            "FieldToMatch": {
              "SingleHeader": {
                "Name": "user-agent"
              }
            },
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ],
            "PositionalConstraint": "CONTAINS"
          }
        },
        {
          "ByteMatchStatement": {
            "SearchString": "Google-Extended",
            "FieldToMatch": {
              "SingleHeader": {
                "Name": "user-agent"
              }
            },
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ],
            "PositionalConstraint": "CONTAINS"
          }
        },
        {
          "ByteMatchStatement": {
            "SearchString": "Amazonbot",
            "FieldToMatch": {
              "SingleHeader": {
                "Name": "user-agent"
              }
            },
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ],
            "PositionalConstraint": "CONTAINS"
          }
        },
        {
          "ByteMatchStatement": {
            "SearchString": "FacebookBot",
            "FieldToMatch": {
              "SingleHeader": {
                "Name": "user-agent"
              }
            },
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ],
            "PositionalConstraint": "CONTAINS"
          }
        },
        {
          "ByteMatchStatement": {
            "SearchString": "Claude-Web",
            "FieldToMatch": {
              "SingleHeader": {
                "Name": "user-agent"
              }
            },
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ],
            "PositionalConstraint": "CONTAINS"
          }
        },
        {
          "ByteMatchStatement": {
            "SearchString": "cohere-ai",
            "FieldToMatch": {
              "SingleHeader": {
                "Name": "user-agent"
              }
            },
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ],
            "PositionalConstraint": "CONTAINS"
          }
        },
        {
          "ByteMatchStatement": {
            "SearchString": "Omgilibot",
            "FieldToMatch": {
              "SingleHeader": {
                "Name": "user-agent"
              }
            },
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ],
            "PositionalConstraint": "CONTAINS"
          }
        },
        {
          "ByteMatchStatement": {
            "SearchString": "omgili",
            "FieldToMatch": {
              "SingleHeader": {
                "Name": "user-agent"
              }
            },
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ],
            "PositionalConstraint": "CONTAINS"
          }
        },
        {
          "ByteMatchStatement": {
            "SearchString": "YouBot",
            "FieldToMatch": {
              "SingleHeader": {
                "Name": "user-agent"
              }
            },
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ],
            "PositionalConstraint": "CONTAINS"
          }
        },
        {
          "ByteMatchStatement": {
            "SearchString": "Bytespider",
            "FieldToMatch": {
              "SingleHeader": {
                "Name": "user-agent"
              }
            },
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ],
            "PositionalConstraint": "CONTAINS"
          }
        },
        {
          "ByteMatchStatement": {
            "SearchString": "Diffbot",
            "FieldToMatch": {
              "SingleHeader": {
                "Name": "user-agent"
              }
            },
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ],
            "PositionalConstraint": "CONTAINS"
          }
        },
        {
          "ByteMatchStatement": {
            "SearchString": "Applebot-Extended",
            "FieldToMatch": {
              "SingleHeader": {
                "Name": "user-agent"
              }
            },
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ],
            "PositionalConstraint": "CONTAINS"
          }
        }
      ]
    }
  },
  "Action": {
    "Allow": {
      "CustomRequestHandling": {
        "InsertHeaders": [
          {
            "Name": "Bot",
            "Value": "true"
          }
        ]
      }
    }
  },
  "VisibilityConfig": {
    "SampledRequestsEnabled": true,
    "CloudWatchMetricsEnabled": true,
    "MetricName": "cloudfront-agent-rule"
  }
}

This will detect the top known AI bots. Next, for the action, be sure to choose "Allow" and to add a custom header. Ours is called bot, but feel free to make this anything unique.

Next, navigate to the CloudFront product and to the "Functions" tab. Create a new function and paste in the following javascript:

function handler(event) {
    if (event.request.headers['x-amzn-waf-bot'] !== undefined) {
        const host = event.request.headers.host.value;
        const uri = event.request.uri;
        const newurl = `https://tollbit.${host}${uri}`;
        const response = {
            statusCode: 302,
            statusDescription: 'Found',
            headers:
                { "location": { "value": newurl } }
        }
        return response;
    }
    return event.request;
}

Earlier, our WAF rule had set a header called bot onto the request if it matched the rule. Amazon automatically appends x-amzn-waf- to the header, so the actual header to look for is now called x-amzn-waf-bot. If this header exists, it means that our WAF rule detected that this request is a bot request, so we now want to forward it to our tollbit subdomain. Once you are ready, save the changes and publish this code. On the publish tab, you will then need to associate this function to your existing CloudFront distribution.

CloudFlare

There are several levels of bot detection and forwarding that you can configure for CloudFlare, depending on whether or not you are on their Enterprise plan.

Bot Deterrence on any Plan (Including Free)

Follow the steps described here up until you have created a new worker. Name this working something to help you keep track of it's function (such as bot-forwarding-worker). Once you've created this worker, click into edit code and do the following to set up your forwarding worker.

If you have not set up log forwarding and just want to forward bot traffic, put this code in your worker.js file.

// this is a non-exhaustive list of agents that we recommend you get started with first
// Add any other agents you would like to forward into this list.
const botList = [
  "ChatGPT-User", "PerplexityBot", "GPTBot", "anthropic-ai", "CCBot", "Claude-Web", "ClaudeBot", "cohere-ai", "YouBot", "Diffbot"
]

export default {
  fetch (request) {
    const userAgent = request.headers.get('User-Agent') || ''
    const path = request.url.replace("https://" + request.headers.get("host"), "")
    let host = request.headers.get('host') || ''
    if (host.startsWith('www.')) {
      // remove www
      host = host.slice(4);
    }
    for (var i = 0; i < botList.length; i++) {
      if (userAgent.includes(botList[i])) {
        return Response.redirect('https://tollbit.' + host + path, 302)
      }
    }

    // Default behaviour
    return fetch(request);
  }
}

If you have set up log forwarding, copy and replace your worker.js file with this code instead. Make sure that you keep your Tollbit token copied over into the code.

// this is a non-exhaustive list of agents that we recommend you get started with first
// Add any other agents you would like to forward into this list.
const botList = [
  "ChatGPT-User", "PerplexityBot", "GPTBot", "anthropic-ai", "CCBot", "Claude-Web", "ClaudeBot", "cohere-ai", "YouBot", "Diffbot"
]

const CF_APP_VERSION = '1.0.0'

const tollbitLogEndpoint = "https://log.tollbit.com/log";
const tollbitToken = 'YOUR_SECRET_KEY_HERE'

const sleep = ms => {
  return new Promise(resolve => {
    setTimeout(resolve, ms)
  })
}

const makeid = length => {
  let text = ""
  const possible = "ABCDEFGHIJKLMNPQRSTUVWXYZ0123456789"
  for (let i = 0; i < length; i += 1) {
    text += possible.charAt(Math.floor(Math.random() * possible.length))
  }
  return text
}

const buildLogMessage = (request, response) => {
  const logObject = {
    timestamp: new Date().toISOString(),
    client_ip: '', // worker only is able to get cloudflare edge IP, leaving blank
    geo_country: request.cf["country"],
    geo_city: request.cf["city"],
    geo_postal_code: request.cf["postalCode"],
    geo_latitude: request.cf["latitude"],
    geo_longitude: request.cf["longitude"],
    host: request.headers.get("host"),
    url: request.url.replace("https://" + request.headers.get("host"), ""),
    request_method: request.method,
    request_protocol: request.cf["httpProtocol"],
    request_user_agent: request.headers.get("user-agent"),
    request_latency: null, // cloudflare does not have latency information
    response_state: null,
    response_status: response.status,
    response_reason: response.statusText,
    response_body_size: response.contentLength
  }
  return logObject;
}

// Batching
const BATCH_INTERVAL_MS = 20000 // 30 seconds
const MAX_REQUESTS_PER_BATCH = 500 // 500 logs
const WORKER_ID = makeid(6)

let workerTimestamp

let batchTimeoutReached = true
let logEventsBatch = []

// Backoff
const BACKOFF_INTERVAL = 10000
let backoff = 0

async function addToBatch(body, event) {
  logEventsBatch.push(body)

  if (logEventsBatch.length >= MAX_REQUESTS_PER_BATCH) {
    event.waitUntil(postBatch(event))
  }

  return true
}

async function handleRequest(event) {
  const { request } = event

  const response = await fetch(request)
  const isBotRequest = checkIfBotRequest(request)

  const eventBody = buildLogMessage(request, response)
  event.waitUntil(
    addToBatch(eventBody, event),
  )

  if (isBotRequest) {
    const path = request.url.replace("https://" + request.headers.get("host"), "")
    let host = request.headers.get('host') || ''
    if (host.startsWith('www.')) {
      // remove www
      host = host.slice(4);
    }
    return Response.redirect('https://tollbit.' + host + path, 302)
  }
  return response
}

const fetchAndSetBackOff = async (lfRequest, event) => {
  if (backoff <= Date.now()) {
    const resp = await fetch(tollbitLogEndpoint, lfRequest)
    if (resp.status === 403 || resp.status === 429) {
      backoff = Date.now() + BACKOFF_INTERVAL
    }
  }

  event.waitUntil(scheduleBatch(event))

  return true
}

const postBatch = async event => {
  const batchInFlight = [...logEventsBatch.map((e) => JSON.stringify(e))]
  logEventsBatch = []
  const body = batchInFlight.join('\n')
  const request = {
    method: "POST",
    headers: {
      "TollbitKey": `${tollbitToken}`,
      "Content-Type": "application/json"
    },
    body,
  }
  event.waitUntil(fetchAndSetBackOff(request, event))
}

const scheduleBatch = async event => {
  if (batchTimeoutReached) {
    batchTimeoutReached = false
    await sleep(BATCH_INTERVAL_MS)
    if (logEventsBatch.length > 0) {
      event.waitUntil(postBatch(event))
    }
    batchTimeoutReached = true
  }
  return true
}

const checkIfBotRequest = (request) => {
  const userAgent = request.headers.get('User-Agent') || ''
  
  for (var i = 0; i < botList.length; i++) {
    if (userAgent.includes(botList[i])) {
      return true
    }
  }
  return false
}

addEventListener("fetch", event => {
  event.passThroughOnException()

  if (!workerTimestamp) {
    workerTimestamp = new Date().toISOString()
  }

  event.waitUntil(scheduleBatch(event))
  event.respondWith(handleRequest(event))
})

This code will immediately let through anyone with a known browser, and check all other requests against a list that we will periodically update with known bad user agents.

Enterprise

If you have CloudFlare enterprise, you should be able to use the Bot Management product to get a bot score for each request. You can add logic in the above code's checkIfBotRequest function to also return true if the bot score is lower than a certain threshold.

Fastly

Fastly allows you to set up redirectly using VCL snippets. In this document, we will go over setting up forwarding requests from known bots to your tollbit subdomain.

Go to the Deliver tab and select the domain you wish to add bot forwarding to. On the right side of the screen, click the Edit configuration button and choose to clone your current active version.

On the left hand sidebar, click "VCL Snippets".

Create a snippet and name it something like tollbit-bot-forwarding-recv. This is the VCL code that will detect if a bot is using one of our known bad user agents, and will forward it to your subdomain. Put the following logic into the snippet. Make sure that the placement of the snippet is within the recv subroutine.

Copy and paste the following code block into the VCL input field and save. Don't worry, this VCL script will not actually apply until you activate the current Fastly version that you are editing.

if (req.http.user-agent ~ "(?i)chatgpt-user|perplexitybot|gptbot|anthropic-ai|ccbot|claude-web|claudebot|cohere-ai|youbot|diffbot") {
  if (std.prefixof(req.http.host, "www.")) {
    set req.http.host = std.replace_prefix(req.http.host, "www.", "tollbit.");
  } else {
    set req.http.host = "tollbit." + req.http.host;
  }
  error 600;
}

Next, create another VCL snippet. This time, call it something like tollbit-bot-forwarding-error. This time, make sure that the placement is within the error subroutine.

Paste the following code in this snippet. This will set the correct headers and status code for the redirection done in the previous snippet.

if (obj.status == 600) {
  set obj.status = 307;
  set obj.response = "Temporary Redirect";
  set obj.http.Location = req.protocol + "://" req.http.host + req.url;
  set obj.http.cache-control = "max-age=0";
  return (deliver);
}

This should now be all you need to forward known bot traffic to your tollbit subdomain! You can activate these changes by clicking "Apply".

Akamai

Akamai allows you to set up redirection rules at the edge using Cloudlets. Specifically, they provide Edge Redirector Cloudlets that help you manage redirection using certain matching rules.

We want to first start by creating an Edge Redirector policy. Follow the documentation here to do so in accordance with how your Akamai instance is set up.

Once you have set up your policy, follow the documentation here to set up rules for your Edge Redirector. Because we want to be redirecting based on the User-Agent header, we will need to create a redirector with advance matching rules. You will want to create a match type based on the request header. The name of the header should be User-Agent, and the value should be a tab separated list of bad user agents. You can use the following list:

ChatGPT-User PerplexityBot GPTBot anthropic-ai CCBot Claude-Web ClaudeBot cohere-ai YouBot Diffbot

For the operator value, use is one of without case sensitivity. These settings should let you match our known bad users agents. In the redirection rule, you can set the redirect url to your tollbit subdomain.

Click save rule to save your changes, and you should be ready to activate! Follow the steps here to do so.

Transactions

This page provides an audit trail where you can see all the requests that have been made to your website through Tollbit. For each request, you are able to see the user agent that made the request, the page they hit, and the price they paid for that page.

Asset Management

Control what data to include or exclude when developers request content from your website through TollBit. You can filter out certain types of assets from the HTML of your website, such as images, links or embedded content. Note that in order to properly filter these out, these assets need to be properly included in your website using well formatted HTML. For example, we won't be able to filter out a hyperlink if it's not within an <a> tag.

For more advanced usecases, you can filter out all elements with a specific HTML class.

For any partners that you have struck deal with, you can upload a custom license the the user agents of that specific partner. Any requests made to TollBit with that partner's user agents will include the license that you uploaded in the transactions.

API Auth Settings

Most of our partners have data available on the open internet. However, if your content is behind an API that requires authentication, you can use this page to set up authentication so our agent can fetch your data on behalf of end users. We support both OAuth as well as header based authentication. In the OAuth case, you would make sure to set your OAuth endpoint and the payload to POST, which will include the user id and secret key. Finally in the Token Key field, you would put the exact key in the json response whose value corresponds to the bearer token that we should use to make authenticated requests.

Content Formatting

We format content to most effectively integrate within AI applications and into LLM contexts. This feature comes out of the box when using Tollbit. All sites onboarded with Tollbit will work with this functionality.

What does the formatted content look like?

This formatting process makes no changes to the original content. We simply clean the content for you to be perfectly ready for your data pipeline. Specically, the data comes back as a markdown representation of the original web page. The main field of the content response will likely contain the actual content of the article without any clutter of navigational components or social media links. Should you want to use those fields, you may get them from the header and footer fields if we were able to parse them out.

Finally, the metadata field may contain additional information that isn't part of the original content, but can provide additional context around the content. This could include raw data, follow up link, or additional topics for the end user to explore.

The following is what a user who hits our FAQs page might see.

Example Content Formatting

https://tollbit.com/faqs
{
        "content": {
            "header": "",
            "main": "
            # TollBit - FAQs

            [Get started](https://signup.tollbit.com)

            # FAQs

            [Request a demo](https://signup.tollbit.com)

            -

            ## What is TollBit?

            TollBit is a first-of-its-kind platform to help websites ensure fair compensation for their content and data. The platform allows AI bots and data scrapers to pay websites directly, rewarding quality content creation and mitigating the legal uncertainty of scraping.

            -

            ## How does the platform work?

            On the supply side, TollBit’s clients are companies with openly accessible websites, whose data are vulnerable to scraping. They includes publishers, sites with user-generated content, and sites that allow end users to take action - such as e-commerce sites.

            Using TollBit, websites can sign up rates and rules can be set for autonomous (non-human) access to any specific URL. Tollbit also provides powerful analytics and visibility to companies about autonomous traffic.

            On the other side, companies doing the scraping today can use Tollbit to access content and data on websites for a fee in exchange for licensing and a cleaner more digestible version of the URL page.

            Tollbit enables websites to realize the true value of their data, which would otherwise be prone to payment-free scraping.

            -

            ## Are you onboarding publishers?

            Yes, we are onboarding publishers and partners.

            -

            ## A number of publishers are cutting their own content licensing deals with tech companies - how would this platform impact the platform?

            TollBit is an “and” product. We encourage our websites to pursue 1:1 licensing deals when they make sense. TollBit can also help provide critical missing infrastructure for licensing deals, including reporting, rate limits, and authentication.

            -

            ## How do you expect to set pricing/value of content?

            On-demand access rates are something that will depend on the unique needs and business model of our individual clients. Publishers set their own rates on TollBit. However, private licensing deal terms are never public.

            -

            ## Was this developed in concert with any publishers?

            The development of the TollBit platform was informed by conversations with dozens of publishers and the product will continue to be improved with their feedback.

            -

            ## Do publishers need to have a paywall in order to use TollBit to generate revenue? If information is already in the public domain then how can AI companies be expected to pay?

            Publishers do not have to have an existing paywall in place to generate revenue via TollBit.

            -

            ## Are there restrictions on how content can be used once licensed through TollBit?

            Yes, there are specific scopes and restrictions of the on-demand license.

            If you use the content/data in a form that is not covered by the license, then the license is not valid and you do not have protection or permission for use.

            Made with care in Nashua, Boston, and

            New York City

            © Novoscribe, Inc. 2024",
            "footer": ""
        },
        "metadata": "",
        "rate": {
            "priceMicros": 20000,
            "currency": "USD",
            "licenseType": "ON_DEMAND_LICENSE",
            "licensePath": "...",
            "error": ""
        }
    }

As you can see, none of the original content is affected in any way. Just some trimming down of the extra HTML!

Why is this format good for AI?

This formatting maintains crucial context for the LLM like titles, paragraphs, links, etc. At the same time, this format strips away the exessive HTML tags, scripts and other clutter that comes back from scraping typical websites. This format should optimize the value in the content, while being efficient to how many tokens you pass to the LLM.

Was this page helpful?