Written by Oscar
Building a no-code automation to monitor robots tags
I’d been looking for an excuse to try out apps that promise you can build complex automations using no-code or low-code interfaces, with a dash of AI. Obviously.
So what did I find? Well, this article is for those who are not developers but love a technical puzzle that needs solving. Developers will probably just wonder why I didn’t write a simple app in PHP or something similar!
The problem to solve
We use a few different apps to monitor client websites for which we have a maintenance agreement. However, none of them properly monitor the robot’s status (the instructions for web crawlers) for a site, section, or specific page.
This is important because a mistake here can either hide your website from search engines or reveal pages that you intended to be hidden, albeit publicly accessible.
After submitting multiple feature requests to existing app makers and even hearing some positive responses, nothing has appeared that meets our needs.I will also add that the key point here is that we can not use Google Search Console to monitor this… if Google is telling you, it’s too late!
The objective
Build a cost-effective automated app to check our websites and ensure the correct robots settings are in place. And if it finds a mismatch, alert us by telling us what the current status is, and what it should be.
For this to work, we need to tell the system what to check against; otherwise, we are up against multiple automated checks that could create false matches.
What are these ‘robot instructions’ you’re talking about? I hear you ask.
It’s the rules you give a web crawler about what it can crawl and index. They are really important in the world of SEO.
For simplicity, we are focusing on the most common instructions we usually use. This includes:
- all: do what you want with this content!
- noindex: do not include this page in any search results.
- nofollow: you can index it, but you can’t discover new pages that may be linked in the content.
- none: don’t use this content - the same as no index, nofollow.
In the context of this project, I’m talking about three different ways that you tell crawlers what they’re allowed to do (which is what makes it complicated to automate):
Robots.txt
This is a simple file hosted at the domain root e.g. example.com/robots.txt within which you can provide instructions about what a crawler can do. This file is useful when you want to provide instructions for the whole site all at once. This always takes priority over the next two…
Meta robot tag
This is a simple piece of information that you can view in the HTML of the page if you view the source. Similar to the meta titles and descriptions that most web editors have come across. However, it lets us control, on a page-by-page basis, what search engines are allowed to index, and it’s generally the preferred method.
X-Robots-Tag
This is included in the HTTP headers - they’re a little harder to view, but in essence, provide additional information or instructions about the page requested by your browser. This tag is used in the same way as the meta robot tag.
Technically, I should also consider the canonical reference you can include, but let’s take it one step at a time.
Now that we have that cleared up…
Where to start?
There are now numerous services that allow you to set up automations: Zapier, IFTTT, Make, among others.
I created accounts with several major providers and tested setting up some basic automations to get a feel for how intuitive they each were.
At this stage, I was still trying to figure out how to achieve my goal. Most of these tools rely on linking existing services, such as pulling data from Google Sheets or HubSpot, then performing an action before sending it elsewhere based on conditions. But I wanted to fetch a web page and check the value of the three different ways we set the robots instructions.
To get me going, I needed a way to scrape a web page. While researching this, I asked the AI chatbot in Make.com how I could use its tools to scrape data. It was the first to suggest a viable solution; as a result, I decided to start my build with make.com. I liked the interface, and it appeared to be able to do what I needed.
I started to make progress by switching strategy and using the AI chatbot to help suggest smaller incremental adjustments, given that my more detailed instructions led to solutions that did not work. However, the consequence of this was that I quickly used up my free allowance. And as I found, this was the first of many limits I passed learning to use the systems.But I was off and moving forward!
Where did I want to get to?
I started to sketch a more detailed process to achieve my objective. I had managed to scrape some data, which I assumed was the most complex part, but building out the logic on either side threw up lots of smaller hurdles.I knew the automation flow needed to cover these broad steps:
- Fetch the HTML code for a given URL.
- Process the response and find the robot tags.
- Compare the tag value against what it should be.
- Send an alert if it does not match.
Sounds simple enough.
However, as I started to construct the outline for this, I discovered a few challenges and questions when working with the make.com tool:
- I needed to check multiple URLs - easy when using just one test URL!
- What happens if there is an error? One failed step stopped all subsequent checks, so I needed to cater for different website setups.
- I could not use the native HTML parser to check all three types of robots locations, so I needed a way to run some more complex queries
- I kept hitting free limits when testing ideas, which caused delays. I wasn’t ready to commit since I wasn’t sure it would work in Make.
- It seemed there was no way to store data inside Make.
- I had to provide my own SMTP details.
In overcoming these challenges, I kept having to add more services.Natively Make was able to take a list of URLs and sequentially process each value, running the full process per item (their ‘iterator’). I could also fetch the web page via their HTTP module and do some basic parsing of that content with their ‘Text parser’.The other steps required additional assistance from AI and services.
Here are a few examples:
Store data
I created an Airtable account, which allowed me to build a spreadsheet of data that Make could integrate with. The iterator module could then grab a new URL and run the automation. Once complete, it started a new automation for the next URL. At the end of the process, I was able to save the robot values my app found in Airtable, along with the date it was last checked. On the upside, this allows me to view all the URLs and the findings in Airtable, acting as a report of the current statuses. Of course, I quickly moved into paid tier territory!
Process HTML
Due to the limitations of the built-in modules, I needed to use some custom JS to read the HTTP headers, find the X-Robots-Tag, and extract it. For this, I brought in yet another service called ‘Custom JS’, which allowed me to process JS and send a response back to Make. But since I’m not a developer, I turned to ChatGPT to help refine the JS I needed. It helped me write the logic. In this situation, my experience with code allowed me to brief ChatGPT in clear terms what I needed; however, as I’ve learnt, you need to work incrementally to get the best results. While ChatGPT got the job done, I still needed to double-check the regex code it had written in https://regex101.com and make some adjustments (I think, like maths, it struggles with regex sometimes).
I also had to create a separate custom JS script to process the robots.txt file. The main issue is that this adds yet another use of operations, as they call it, which incurs additional costs.
Send emails
This part was easy since we already use multiple email delivery services. I used https://www.mailgun.com in this setup.
The rest was then completed using the built-in conditional logic that Make excels at. The final setup ended up doing the following:
- Integrate with Airtable to retrieve the URLs to check.
- Run the automated process one URL at a time.
- Fetch the HTML with an HTTP request. Errors dealt with.
- Parse the HTML as text and extract the robots meta tag using regular expression (regex) then output the result as data.
- Fetch the HTTP headers with a new HTTP request.
- Run the JavaScript with Custom JS to extract the tag and output it as an object for processing later.
- Lastly, fetch the robots.txt page with a final HTTP request and extract the settings.
- Fetch the correct settings from Airtable.
- Split the route into two steps:
- Compare the value in Airtable with what the app found. If a mismatch is found, trigger an email and include the URL and the correct and found values for each robot setting.
- Combine all the processed outputs and save them in the respective columns in Airtable.
In summary, I used the following services:
- Ensure that you manage the automation flow effectively. The Make.com AI was useful when getting started, but became less helpful as the tasks became more complex.
- Airtable to store the list of URLs to be checked and record what the app found.
- Custom JS to execute JavaScript.
- ChatGPT to help write and refine the JavaScript.
- Regex101 to check that the text matches were correct.
- Mailgun to send the alerts.
Running all these services costs approximately $80 per month for a small number of operations. If I were to scale this to manage a larger database of URLs with team access, the cost would increase to over $200 per month.
Learnings
- It’s possible to create useful automations with existing tools. And a good way to build proof of concepts.
- If you want to deviate from the existing integrations, you can quickly get stuck if you are not a developer. Be prepared for a lot of trial and error adjustments.
- AI can help, but you still need to know what you want from it. That said, I wouldn’t have been able to complete the project without its help (though of course, I could also have asked my development team!).
- While built-in functionality is helpful, I quickly discovered that I needed features that were not native, which led me to having to integrate with other third-party services.
- The cost quickly increased once I had to move from a free to a paid tier across multiple services, even for a small number of automations.
- Learning multiple tools to achieve my objective was enjoyable, but it will make it harder to pass on to someone else who has not been through the same learning curve.
- There’s hidden complexity and risk in maintaining multiple services, any one of them could break the whole setup if they change their terms, rebrand, or shut down.
Conclusion
I met my objective and built an automated tool that checks a list of websites to ensure the correct instructions for search web crawlers are in place. However, it has its limitations, and it will not scale in a way that I would want or need.
What’s next?
For now, it fulfils a valuable role as a fallback to our usual checks, but if I wanted to build in more complex checks and automations, I would choose to rebuild it as a custom app. The up-front cost would be greater, but the ongoing cost would be lower as it scales. On the upside, planning and building a custom app would be simpler now that we have a useful MVP already collecting real-world data.