Dealing with unprompted requests causing high bandwidth and request count.
18 replies
Last updated: Mar 14, 2022
V
I've already gotten great, personal assistance from Sanity with regards to investigating bandwidth/request concerns, but has anyone out there in the community had experience dealing with what appears to be unprompted requests (or rather, requests unprompted by humans) running up the tally?
I am seeing a few hundred people causing gigs of bandwidth and tens of thousands of requests despite using API CDN and caching the rendered content server-side through PHP. Today, with fewer visits across all pages, and with more aggressive caching, the same queries resulted in more requests than days when there were fewer visits.
To be clear: Sanity is running fine, and honoring every request; it isn't a Sanity issue -- but there are more requests being asked of it than I can account for despite the code just being looping through $client->fetch queries.
I am seeing a few hundred people causing gigs of bandwidth and tens of thousands of requests despite using API CDN and caching the rendered content server-side through PHP. Today, with fewer visits across all pages, and with more aggressive caching, the same queries resulted in more requests than days when there were fewer visits.
To be clear: Sanity is running fine, and honoring every request; it isn't a Sanity issue -- but there are more requests being asked of it than I can account for despite the code just being looping through $client->fetch queries.
Mar 2, 2022, 1:18 AM
A
Hi Vincent. It sounds like you're only accessing Sanity from the server (not the client). If that's the case, I'd recommend making your dataset private to prevent anybody without an auth token making requests. Only your PHP script will be able to make requests on your behalf.
You can find out more about securing dataset access here .
One thing to note is that when using an auth token to access you dataset, draft documents will be included.
You'll need to update your GROQ queries to exclude draft documents .
You can find out more about securing dataset access here .
One thing to note is that when using an auth token to access you dataset, draft documents will be included.
You'll need to update your GROQ queries to exclude draft documents .
Mar 2, 2022, 3:45 PM
V
user E
Thanks for the response. We have a setup of flat file PHP pages starting with a define('SANITYCLIENT', true)and then
requirea the Sanity PHP client in a separate folder off the root. In that file there's a check for the constant before it allows the query.
We wanted to prevent direct access or malware / guessed password to call the client willy-nilly, as they'd neither have nor expect that constant. Instead, I could just add it to the couple pages that actually need it.
If we are getting flooded, I can't afford to not use the apicnd/useCdn, but would a token help here to authenticate if the only things running the fetch are the pages that need to?
The fetches themselves seem to only come from pages that request them, it's just that the
number of fetches doesn't match the number of page loads -- I saw gaps of minutes between people visiting where the server's logged as having made multiple requests a half second apart.
Mar 2, 2022, 4:21 PM
V
I am using awstats in cPanel and it's reporting the same bandwidth sitewide as Sanity itself does for the day.
This is a big, old, sprawling site with I think five WordPress sites counted in towards the results, all littered with pictures --- vs. six to eight flat file pages looping through 8-12ish documents, and one slider with fewer than ten images (that I went out of my way to both convert to JPG and then append with the JPG formatting URL parameter) ...the idea that it got worse consumption-wise with more caching and fewer visits is boggling my mind.
It makes me afraid to try more aggressive caching (like fifteen minutes, say, instead of three) because it's almost like "saving" the rendered content is just....randomly re-firing queries off to Sanity instead? I've not seen anything like it so I am wondering what the huge whiff is on my part. Is there a way to rate limit from the querying itself or something?
This is a big, old, sprawling site with I think five WordPress sites counted in towards the results, all littered with pictures --- vs. six to eight flat file pages looping through 8-12ish documents, and one slider with fewer than ten images (that I went out of my way to both convert to JPG and then append with the JPG formatting URL parameter) ...the idea that it got worse consumption-wise with more caching and fewer visits is boggling my mind.
It makes me afraid to try more aggressive caching (like fifteen minutes, say, instead of three) because it's almost like "saving" the rendered content is just....randomly re-firing queries off to Sanity instead? I've not seen anything like it so I am wondering what the huge whiff is on my part. Is there a way to rate limit from the querying itself or something?
Mar 2, 2022, 11:39 PM
V
I am using awstats in cPanel and it's reporting the same bandwidth sitewide as Sanity itself does for the day.
This is a big, old, sprawling site with I think five WordPress sites counted in towards the results, all littered with pictures --- vs. six to eight flat file pages looping through 8-12ish documents, and one slider with fewer than ten images (that I went out of my way to both convert to JPG and then append with the JPG formatting URL parameter) ...the idea that it got worse consumption-wise with more caching and fewer visits is boggling my mind.
It makes me afraid to try more aggressive caching (like fifteen minutes, say, instead of three) because it's almost like "saving" the rendered content is just....randomly re-firing queries off to Sanity instead? I've not seen anything like it so I am wondering what the huge whiff is on my part. Is there a way to rate limit from the querying itself or something?
This is a big, old, sprawling site with I think five WordPress sites counted in towards the results, all littered with pictures --- vs. six to eight flat file pages looping through 8-12ish documents, and one slider with fewer than ten images (that I went out of my way to both convert to JPG and then append with the JPG formatting URL parameter) ...the idea that it got worse consumption-wise with more caching and fewer visits is boggling my mind.
It makes me afraid to try more aggressive caching (like fifteen minutes, say, instead of three) because it's almost like "saving" the rendered content is just....randomly re-firing queries off to Sanity instead? I've not seen anything like it so I am wondering what the huge whiff is on my part. Is there a way to rate limit from the querying itself or something?
Mar 2, 2022, 11:39 PM
V
There is an unholy amount of bot traffic -- would a crawler generate a query like a normal visit would? If it can, my inference would be that we can't control the crawl or nuke them visiting without harming SEO or the analytics...does that sound right?
Mar 2, 2022, 11:58 PM
A
Thanks for providing those additional details, Vincent.
You make a very good point about using an authentication token with the API CDN. Our API CDN now supports this, but I can see the PHP client hasn't yet been updated to reflect this change. We should update this, but it seems like you've ruled out the possibility that anybody is sending requests to Sanity directly.
The other thing I wondered is whether it's possible that a file that's querying Sanity is being required multiple times in your app? That would explain why a single request to your site spawns multiple requests to Sanity.
It's tricky to debug much further without seeing your source code. Is that something you're able to share?
I'd personally be tempted to remove the
You make a very good point about using an authentication token with the API CDN. Our API CDN now supports this, but I can see the PHP client hasn't yet been updated to reflect this change. We should update this, but it seems like you've ruled out the possibility that anybody is sending requests to Sanity directly.
The fetches themselves seem to only come from pages that request them, it's just that the number of fetches doesn't match the number of page loads -- I saw gaps of minutes between people visiting where the server's logged as having made multiple requests a half second apart.Can I ask how you're hosting your PHP service? For example, whether you use PHP-FPM. This comment makes me wonder if something could be going on with request handling or process pooling.
The other thing I wondered is whether it's possible that a file that's querying Sanity is being required multiple times in your app? That would explain why a single request to your site spawns multiple requests to Sanity.
It's tricky to debug much further without seeing your source code. Is that something you're able to share?
it's almost like "saving" the rendered content is just....randomly re-firing queries off to Sanity instead?That's very odd behaviour indeed. Unless there's a bug in
sanity-php, these queries should only happen when explicitly called. This is another thing that makes me wonder if your app is somehow accidentally calling the function multiple times for each request (e.g. if you've inadvertently required a file that makes requests multiple times).
I'd personally be tempted to remove the
SANITYCLIENTconstant mechanism you have in place. If your server is compromised, I'm not sure this really provides much protection, but it will probably make it trickier for you (or other devs) to reason about your own code. Removing it might make it easier for you to spot any bugs that would cause multiple requests to be sent š.
Mar 3, 2022, 11:43 AM
V
Thanks again for taking so much time to investigate.
PHPINFO() https://gist.github.com/vincentjflorio/88c015d818d5564932cd63d52d6cbc9a
Example page: https://gist.github.com/vincentjflorio/c07f63970dae14e64ff2e8ae6ab16198
Slightly modified client: https://gist.github.com/vincentjflorio/b6147f6309c189e0c9df5679920284dd
These are the only two errors in the error log; we're not bringing in anything novel as far as serializers and our use of links is pulled from the docs and prints out fine on the front end.
I don't think the block content bit is re-running queries on its own; I actually don't know why it has to be fed the same parameters over again...maybe it doesn't and I was being too literal in pulling the samples?
Might there be any benefit skipping the fetching part of the client and just visiting the query addresses?
PHPINFO() https://gist.github.com/vincentjflorio/88c015d818d5564932cd63d52d6cbc9a
Example page: https://gist.github.com/vincentjflorio/c07f63970dae14e64ff2e8ae6ab16198
Slightly modified client: https://gist.github.com/vincentjflorio/b6147f6309c189e0c9df5679920284dd
These are the only two errors in the error log; we're not bringing in anything novel as far as serializers and our use of links is pulled from the docs and prints out fine on the front end.
I don't think the block content bit is re-running queries on its own; I actually don't know why it has to be fed the same parameters over again...maybe it doesn't and I was being too literal in pulling the samples?
Might there be any benefit skipping the fetching part of the client and just visiting the query addresses?
[22-Feb-2022 20:25:18 UTC] PHP Fatal error: Uncaught GuzzleHttp\Exception\ConnectException: cURL error 28: Connection timed out after 30000 milliseconds (see <https://curl.haxx.se/libcurl/c/libcurl-errors.html>) for <https://oqceb8ti.apicdn.sanity.io/v2021-08-31/data/query/production?query=%2A%5B_type%20%3D%3D%20%22news%22%20%26%26%20%22Front%20Page%22%20in%20categories%5B%5D-%3Etitle%20%20%5D%20%7B%20publishedAt%2C%20title%2C%20subtitle%2C%20body%20%7D%20%7C%20order%28publishedAt%20desc%29> in /home/nihbweb/public_html/sioc/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php:210 Stack trace: #0 /home/nihbweb/public_html/sioc/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php(158): GuzzleHttp\Handler\CurlFactory::createRejection(Object(GuzzleHttp\Handler\EasyHandle), Array) #1 /home/nihbweb/public_html/sioc/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php(110): GuzzleHttp\Handler\CurlFactory::finishError(Object(GuzzleHttp\Handler\CurlHandler), Object(GuzzleHttp\Handler\EasyHandle), Object(GuzzleHttp\Handler\CurlFactory)) #2 /home/nihbweb/public_html in /home/nihbweb/public_html/sioc/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php on line 210
[22-Feb-2022 21:11:51 UTC] PHP Notice: Undefined index: href in /home/bewbhin/public_html/sioc/vendor/sanity/sanity-php/lib/BlockContent/HtmlBuilder.php on line 120
Mar 3, 2022, 4:32 PM
A
I can't see any obvious issues in your code. If you search your project for
$client->fetch, is it being called anywhere other than in pages?
Mar 8, 2022, 12:06 PM
V
user E
No. It isn't anywhere else. I placed them manually, and when I downloaded the root and ran a GREP tool on that recursively just in case I brain-farted and made a mistake I couldn't find anything, even when searching with regular expressions.P.S. I talked to LiteSpeed support and they can't identify the cache being the source of overly repeat/regurgitated queries.
Mar 8, 2022, 6:11 PM
V
user E
No. It isn't anywhere else. I placed them manually, and when I downloaded the root and ran a GREP tool on that recursively just in case I brain-farted and made a mistake I couldn't find anything, even when searching with regular expressions.Mar 8, 2022, 6:11 PM
A
Hmm! That's very peculiar indeed. Have you tried writing to a log each time you make a query?
Mar 10, 2022, 4:38 PM
V
user E
That's a great idea, actually, thanks. If it's getting triggered somehow tens of thousands of times I might be worried about all those disk writes slowing down their already weird hosting but if it gives more diagnostic utility it's worth a shot.I might try to briefly augment the function parameters so the logs are genuinely being fired at the same time as the query itself. I'll run something tonight and report back.
Mar 10, 2022, 4:59 PM
V
user E
Just following up after a couple nights of testing. I think the "mystery" is the poor or limited disclosure with the server logs. Check out the difference just with targeting the three biggest offenders (below). I'm no statistician but that definitely feels like a downward trend. The variation before even for nights/weekends wasn't a notable difference.One in particular, called PetalBot, is apparently a crawler ramping up to build a reference for an upcoming (or up-and-coming) search engine. But I got four crawls on the same page in a minute. That's craziness. And two of the most popular bots are location-specific to engines used in countries where our site is totally irrelevant, so we don't super care how the absence of the content affects our ranking on those.
One, Semrush, we have to not block because apparently they have their fingers all in the information it provides, but by the numbers it's spam-like levels of activity page by page, and definitely outnumbers the human visits going by this raw generated log, which is supremely annoying for Sanity-related purposes. But I think I can control the costs better now until someone gets in front of all this on their IT side.
Thank you and everyone else for the long-term attention to detail helping me arrive at a solution. I was getting real anxiety from it.
Mar 13, 2022, 5:02 AM
V
Astounding! There was on in particular called the "uptimerobot" that so far as I can tell, nobody asked to be on there. Obviously it's content agnostic since it's just checking if something is up. Anyway, that one change from last night, look at the difference below. It promises to run itself at least once every five minutes, so nearly 500 triggers of each query.
You can see the running total taper off its curve. It's especially interesting that the bandwidth doesn't taper off
as much since the bots visit the homepage more than anything and that's where our only pictures are, but there are no file attachments there like there are other places so maybe they represented "emptier" traffic and people are picking PDFs off the other pages?
In either case I am certain this is their the big cheese so I am marking this extra solved.
š
You can see the running total taper off its curve. It's especially interesting that the bandwidth doesn't taper off
as much since the bots visit the homepage more than anything and that's where our only pictures are, but there are no file attachments there like there are other places so maybe they represented "emptier" traffic and people are picking PDFs off the other pages?
In either case I am certain this is their the big cheese so I am marking this extra solved.
š
Mar 14, 2022, 4:56 AM
A
Thank you for the detailed follow up, Vincent. I'm sure this would be an interesting read for other folks in the same situation. I'm glad you were able to resolve this by blocking some of the bots!
Caching would probably be helpful for you, too. If you're still experiencing issues getting it in place, please let us know
š.
Caching would probably be helpful for you, too. If you're still experiencing issues getting it in place, please let us know
š.
Mar 14, 2022, 9:09 AM
A
Thank you for the detailed follow up, Vincent. I'm sure this would be an interesting read for other folks in the same situation. I'm glad you were able to resolve this by blocking some of the bots!
Caching would probably be helpful for you, too. If you're still experiencing issues getting it in place please let us know
š.
Caching would probably be helpful for you, too. If you're still experiencing issues getting it in place please let us know
š.
Mar 14, 2022, 9:09 AM
V
user E
Thanks. This thread is longer and older now but in my first message the caching I mentioned was server-side. They aren't super tech-savvy so I would worry about doing, say, an hour at a time as they'd be confused about not seeing instant changes, so I made it five minutes so that I had a fallback explanation with the Sanity API taking a nonzero amount of time to flush anyway. That definitely also helped. A better actual-user experience in terms of response and then the bot activity is just receiving old things instead of running up the score on me šMar 14, 2022, 3:16 PM
V
user E
Thanks. This thread is longer and older now but in my first message the caching I mentioned was server-side. They aren't super tech-savvy so I would worry about doing, say, an hour at a time as they'd be confused about not seeing instant changes, so I made it five minutes so that I had a fallback explanation with the Sanity API taking a nonzero amount of time to flush anyway. That definitely also helped. A better actual-user experience in terms of response and then the bot activity is just receiving old things instead of running up the score on me šMar 14, 2022, 3:16 PM
Sanityā build remarkable experiences at scale
Sanity is a modern headless CMS that treats content as data to power your digital business. Free to get started, and pay-as-you-go on all plans.