Roundup Tracker - Issues

Issue 2551036

classification
Title: Support rate limiting in REST (and xmlrpc) interfaces
Type: resource usage Severity: normal
Components: Web interface Versions: devel
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: rouilj Nosy List: rouilj, schlatterbeck
Priority: normal Keywords: rest

Created on 2019-03-24 01:54 by rouilj, last changed 2019-05-26 17:43 by rouilj.

Messages
msg6422 Author: [hidden] (rouilj) Date: 2019-03-24 01:54
At some point we should add a mechanism to limit REST and xmlrpc api
use. This has no impact on the API returned from the call. The
additional headers will simply be ignored by the client.

Suggest implementing the following headers:

    X-Rate-Limit-Limit: number of allowed requests in the current
         period (integer)

    X-Rate-Limit-Remaining: number of remaining requests in the
        current period (integer)

    X-Rate-Limit-Reset: number of seconds left in the current period
        (integer)

See: https://developer.twitter.com/en/docs/basics/rate-limiting.html
for examples. If more than X-Rate-Limit-Limit requests are done in the
period, return HTTP 429 “Too Many Requests” and a json/xml error
document like:

  {
    "error": {
        "msg": "Rate limit exceeded. More than X requests in Y seconds.",
        "status": 429
    }
}

The data elements associated with this function are:

  period: say 10 minutes
  request limit: 100 requests
  current-request count: must be less than request limit 
  period start: now() - this value > period, reset current-request count

These should be associated with a user in roundup.

Initial thought, period and request limit are values added to the user
object/schema.  The user should not be able to edit them.

current-request count and period start could be associated with the
session database (I think it has timeouts on session which would be
ideal for this). Or it could be stored in fields in the user object.
msg6472 Author: [hidden] (rouilj) Date: 2019-05-12 21:20
I have a POC implementation to rate limit the rest interface using the GCRA 
implementation used for issue2550949.  It returns the standard headers:


  X-RateLimit-Remaining: 55
  X-RateLimit-Reset: 3580.955768
  X-RateLimit-Limit: 60

plus a non-standard header that seems to be useful:

  X-RateLimit-Limit-Period: 3600

to provide the period for the X-RateLimit-Limit.

If the user has X-RateLimit-Remaining of 0 (i.e. they tripped
the rate limiter and their request was discarded with a 429)
it also returns a:

 Retry-After: 60

header which is the number of seconds to wait before an api call will be
available to used. Not waiting that long will continue to result in a
429 "Too Many Requests" response.

Also in the 429 case, it returns the payload:

  {
    "error": {
        "source": "ApiRateLimiter",
        "status": 429,
        "msg": "Api rate limits exceeded. Please wait: 60 seconds."
    }
  }

There are two new config settings:

  api_calls_per_interval default 3600
  api_interval_in_sec default 3600 (1 hour)

The examples above were done with api_calls_per_interval set to 60.

The code is written so that a tracker admin can replace the method
that generates the rate limiter. Rather rather than using the config
parameters, values could be stored in the user object in the database
for a modified method to use as the source of the data.

This allows a commercial user of roundup to sell different tiers of
service differing in rate limits.

The POC is not checked in. There are performance issues under high load
when storing the data needed to implement the GCRA.
msg6485 Author: [hidden] (rouilj) Date: 2019-05-25 04:05
I have done some work on this but it still has some caveats.

I am going to commit this, disabled by default, to get it wider
circulation and others to eval it.

Changes:

 backends/sessions_dbm.py: The open method will try to open the
    session database 15 times with a .01 second delay between trials.
    If it runs out of trials, it gives up and returns an error to the
    client. This mitigates the exception:
       _gdbm.error: [Errno 11] Resource temporarily unavailable
    that was raised under high load.

 rest.py: added getRateLimit method to RestfulInstance which returns a
    RateLimit object or None. It uses two config parameters:
    WEB_API_CALLS_PER_INTERVAL and WEB_API_INTERVAL_IN_SEC.  If either
    is set to 0 rate limiting is disabled.  It was added as a method
    to allow a tracker admin to override the simple same rate limit
    for each user and create a replacement that can retreive rate
    limits on a per user basis from the user object.

    Added rate limiting code to the top of dispatch method.
    Data that must persist for the user using the rest
    interface is saved to the session dbm backed database.

 configuraton.py: Gets the new config vars and explanations.
    Default for api_calls_per_interval in the web section is 0,
    so this disables rate limiting.

 doc/rest.txt: describes the settings and how they affect rate limiting
    burst and recovery rates. Also discusses that there are issues with
    the current implementation.

I have run 50 parallel curl operations against a roundup-server
instance. In 5 trials I got a db error in 12 cases.  So the dbm retry
code cut that from 50% failure rate.

However for the rate limit I was testing with, I should have gotten
more 429 (rate limit exceeded) errors. It looks like I lost about
5-10% of the updates from the database. Hence it took more than the
expected number of successful connections before the rate limits
kicked in. I am not quite sure why this was happening. Locking failure
on the session db could explain it, but I can't prove it.

Logging did report that I was retreiving the same timestamp from the
session databases for multiple connections, but that could be some
other issue (failure to invalidate a cache) in the code as well.

Even in this state, I think it will limit bad actors although not with
the precision I had hoped.
msg6486 Author: [hidden] (rouilj) Date: 2019-05-26 17:43
Committed as: rev5732:0e6ed3d72f92 

Fixes to testing for the rest code are in rev5733:62bdcb874433 and
rev5734:5cd9ac3daed7.  These patches attempt to test the retry code that
wraps the opening of an existing database. All of the tests
for otk/session db create a new database during the test. The
session/otk db class is never reloaded. So the code path that runs
is always the one that creates and opens the database and never
the one that opens an existing database.

I haven't figured out how to cause the open call to fail to trigger the
exception handling routines. I could use a pointer from somebody with
better python fu than I have.
History
Date User Action Args
2019-05-26 17:43:13rouiljsetmessages: + msg6486
2019-05-25 04:05:18rouiljsetstatus: new -> open
assignee: rouilj
messages: + msg6485
2019-05-12 21:20:07rouiljsetmessages: + msg6472
2019-03-24 01:54:30rouiljcreate