Issue 2551288: Missing cache headers for REST collection and special endpoints

classification

Title:	Missing cache headers for REST collection and special endpoints
Type:	behavior	Severity:	normal
Components:	API	Versions:

process

Status:	new	Resolution:
Dependencies		Superseder:
Assigned To:		Nosy List:	rouilj
Priority:		Keywords:	rest

Created on 2023-08-02 21:22 by rouilj, last changed 2023-08-02 21:22 by rouilj.

Messages
msg7821	Author: [hidden] (rouilj)	Date: 2023-08-02 21:22
A GET operation on the rest API endpoints /rest/data/<class>/<id> and /rest/data/<class>/<id>/<property> return an ETag header for the object referenced by the <id>. However /rest/data/<class>, /rest/summary and /rest/data do not have any cache headers at all. This makes them unfriendly to caching servers as well as local browser caches. For the /rest/data endpoint, only a schema change will invalidate cached data. Supplementary classes: status, user, keyword are rarely changed compared to the primary classes of issue, msg, file, query. So supplementary classes would benefit from a longer cache time. It makes sense that /rest/summary may not have a cache time as it reports the status of latest issues. But even here a 5 minute window or something based on the time since the last change of any issue would be reasonable to prevent hits on the database. Because PUT/POST is not allowed to these endpoints, a long cache time will not result in a lost update problem. However for primary classes it could result in an incomplete picture of the available data. Would adding caching directives for these endpoints be useful for reducing database load? I don't know how often they are queried but I would expect an index page for issues would be queried often. Will caching these cause more issues with cache invalidation? If so, should we use a must-revalidate or a no-cache/no-store directive on these endpoints? Assuming the data can be cached, how to specify/determine a maxage time per collection/special endpoint? Because roundup isn't always a long lived process, we need to store dynamic cache info somewhere. Would (ab)using the session database to store cache time data like: API-CACHE-/rest/msg = [302400, 604800, 604800, 604800, 604800, 604800, 1691008630] [ current max-age, interval (in sec) for 5th last change, 4th last change, ..., last change, timestamp in sec of last change ] where the middle N (N<=5) numbers is an interval in seconds where there was no change in the underlying class. In this case, there was one message added exactly one week apart. The last number is the timestamp in seconds (UTC timezone) of the last change. The current max-age is calculated from this list using (for example) (1/2 is some random magic number): * 1/2 the smallest value * 1/2 the median value Because of the effect of large outliers on average values, I don't think 1/2 the average is a good metric. I included the max-age when the update is done so it's not calculated while the client is waiting. I assume this will be read more than it is written. For /rest/data, looking at the schema file might work but if the schema is built from imported files, this will fail to capture changes to the imported files that would change the schema. Also checking the file date on every request will be expensive. For those using the rest interface, do you have suggestions on how important this is? How does your client code cache this info?

msg7821

Author: [hidden] (rouilj)

Date: 2023-08-02 21:22

A GET operation on the rest API endpoints /rest/data/<class>/<id> and
/rest/data/<class>/<id>/<property> return an ETag header for the object
referenced by the <id>.

However /rest/data/<class>, /rest/summary and /rest/data do not have any cache headers
at all. This makes them unfriendly to caching servers as well as local browser caches.

For the /rest/data endpoint, only a schema change will invalidate cached
data.

Supplementary classes: status, user, keyword are rarely changed
compared to the primary classes of issue, msg, file, query. So supplementary
classes would benefit from a longer cache time.

It makes sense that /rest/summary may not have a cache time as it reports the status of
latest issues. But even here a 5 minute window or something based on the time since the
last change of any issue would be reasonable to prevent hits on the database.

Because PUT/POST is not allowed to these endpoints, a long cache time will
not result in a lost update problem. However for primary classes it could
result in an incomplete picture of the available data.

Would adding caching directives for these endpoints be useful for reducing database load?
I don't know how often they are queried but I would expect an index page for issues would be
queried often.

Will caching these cause more issues with cache invalidation? If so, should we use a
must-revalidate or a no-cache/no-store directive on these endpoints?

Assuming the data can be cached, how to specify/determine a maxage time per
collection/special endpoint? Because roundup isn't always a long
lived process, we need to store dynamic cache info somewhere.

Would (ab)using the session database to store cache time data like:

API-CACHE-/rest/msg = [302400, 604800, 604800, 604800, 604800, 604800, 1691008630]

[ current max-age,
interval (in sec) for 5th last change,
4th last change,
...,
last change,
timestamp in sec of last change ]

where the middle N (N<=5) numbers is an interval in seconds where there was no change in the
underlying class. In this case, there was one message added exactly one week apart.
The last number is the timestamp in seconds (UTC timezone) of the last change.

The current max-age is calculated from this list using (for example) (1/2 is some
random magic number):

* 1/2 the smallest value
* 1/2 the median value

Because of the effect of large outliers on average values, I don't think 1/2 the average
is a good metric. I included the max-age when the update is done so it's not calculated while
the client is waiting. I assume this will be read more than it is written.

For /rest/data, looking at the schema file might work but if the schema is built
from imported files, this will fail to capture changes to the imported files that
would change the schema. Also checking the file date on every request will be expensive.

For those using the rest interface, do you have suggestions on how important this is?
How does your client code cache this info?

History
Date	User	Action	Args
2023-08-02 21:22:46	rouilj	create

Roundup Tracker - Issues

Issue 2551288