[PATCH] Large Rock and Collapsed Forwarding

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Tue, 31 Dec 2013 12:40:16 -0700

Hello,

    The attached patch contains initial Large Rock and Collapsed
Forwarding support, also available as a Launchpad branch:
lp:~measurement-factory/squid/collapsed-fwd/

Large Rock: Support disk (and shared memory) caching of responses
exceeding one db slot (or one shared memory page) in size. A single db
slot/page size is still limited to 32KB (smaller values can be
configured for disk caches using the newly added cache_dir slot-size
option). Removal of old rock cache dir (followed by squid-z) is required
-- the on-disk db structure has changed.

Collapsed Forwarding: Optionally merge concurrent cachable requests for
the same URI earlier: After the request headers have been parsed (as
before), but now _before_ the response headers have been received.
Merging of requests received by different SMP workers is supported.
Controlled by the new collapsed_forwarding directive in squid.conf.
Disabled by default because all but one of the merged requests have to
be delayed (until the response headers are received) for the merging to
work, which may be worse than forwarding all concurrent requests
immediately. The overall feature idea and request eligibility conditions
are based on Collapsed Forwarding in Squid2.

Here is a summary of other important changes (the branch log contains
the details and I propose that the branch is merged to preserve them):

* Tightened StoreEntry locking. Split StoreEntry::lock() into "just
lock" and "update entry reference time" interfaces, addressing an old
XXX. Improved entry lock/unlock debugging. Needs more work.

* Adjusted StoreIOState::write() API to allow callers detect write errors.

* Simplified MemObject::write() API to remove an essentially unused
callback.

* Mark client streams that sent everything as STREAM_COMPLETE. The old
code used STREAM_UNPLANNED_COMPLETE if the completed stream was
associated with a non-persistent connection, which did not make sense to
me and, IIRC, led to store entry aborts even though the entries were not
damaged in any way.

* mem_hdr::hasContigousContentRange() now returns true for empty ranges.

* Support "appending" ReadWriteLock state that can be shared by readers
and the writer. The writer promises not to update key metadata (except
growing object size and next pointers) and readers promise to be careful
when reading growing slices.

* Fixed StoreEntry::mayStartSwapOut() logic to handle terminated swapouts.

* Improved STORE_MEM_CLIENT detection and documented known (and mostly
old) StoreEntry::storeClientType() problems.

* Removed StoreEntry::hidden_mem_obj hack.

* Polished StoreEntry debugging to report more info, less noise. Use e:
prefix.

* Added a script to extract store entry(ies) debugging from cache.log.
Needs more work.

The latest code passes built-in Squid test cases, some of its revisions
have been independently tested in at least two labs, and its earlier
incarnations have been successfully deployed in at least one production
environment. However, given their size and complexity, the proposed
changes are likely to introduce at least some instability and
regressions if accepted to trunk. We will need to monitor and address
those problems as needed.

This patch does not contain several large/important Store changes that
we still should make [soon], including:

* Optimize rock db loading when Squid starts.

* Support Vary caching in shared memory.

* Fully support cached HTTP response header updates.

* Stop using the Store class as a kitchen sink for Store-related APIs.

* Better names and clear role separation for classes responsible for
memory and disk caches (where "disk cache" is "all cache_dirs taken
together and treated as a single cache").

* Migration of store_table global into something that better integrates
with SMP needs (probably similar to the proposed Transients but without
SMP overheads).

* Refcounting of StoreEntries.

Given the current demand for Large Rock and Collapsed Forwarding
changes, combined with the amount of time it will take to implement the
missing changes listed above, and the size/complexity of the already
implemented changes, I think it is better to merge the implemented
changes now than wait for more changes to come.

An earlier version of this code was posted to squid-dev as a PREVIEW a
few months ago. Amos kindly reviewed that posting. My response to his
review is below.

On 07/03/2013 06:36 AM, Amos Jeffries wrote:

> * Please remove HERE macro from all the new code debugs().

Done. Together with this change, about 50 HEREs are now removed :-).

> in MemBlob.*
> * this API change could go in immediately and shrink the branch diff a
> little.

Not done: I am generally against adding unused APIs and the reduction in
size would be negligible. If the patch is approved, I plan to commit
this change separately though.

> in StoreEntry
> * Since you are changing the API to StoreEntry::lock()/unlock() can you
> please take the opportunity to migrate them both to different method
> names and use the base/Lock.h API for the background counting mechanics.
> That will allow proper ref-counting Pointer to cleanup more of those
> places we are using nasty "protect-our-feets" lock hacks like that bug
> 3480 which you point out in the store comments.

Only partially done: The StoreEntry locking API is now even more strict,
with all raw lock hacks removed. I agree that reusing base/Lock is a
good idea, but:

The Lock class may [unnecessary] increase the StoreEntry class size
because it uses an int-based counter while StoreEntry uses a short-based
one. We can solve that problem by turning Lock into a template, but that
probably warrants a separate discussion and can be done as a separate
project.

With or without Lock reuse, we are still very far from using proper
refcounting for StoreEntries. That huge change is needed, but should be
done separately IMO because it will change lots of otherwise irrelevant
code and because avoiding circular references may be tricky. This is one
of the remaining API TODOs listed above.

> in CollapsedForwarding.cc:
> * please sort the #include lines alphabetically as per the guidelines

Done, I hope.

> * the TODO "add proper message type?" seems to already be done.

Comment removed.

> in src/store_dir.cc
> * adding empty line at chunk ~818

Removed (PREVIEW patches are not supposed to be audited for that though;
I did not check for those things and other formatting issues when
posting PREVIEW).

> in Transients.cc:
> * #include ordering again.

Fixed.

> * there are several occurances of double-empty lines in this file (maybe
> elsewhere in the new code too).

Fixed?

I manually removed a few double-empty lines between methods (and, I am
guessing, the source-maintenance script removed others), but I actually
like double-empty lines to separate large parts of the code (e.g.,
#include statements, global variable declarations, and methods). We
should be consistent, but I am not sure a general double-line
prohibition is such a good idea.

> * unless there is a reason to use signed <0 values for
> Transients::readers [and StoreController::transientReaders] please make
> their return values an unsigned type.

They are signed because they return a signed value (Atomic::Word). It is
probably possible to make Atomic::Word unsigned or use a different
atomic type for ReadWriteLock counters, but that is existing and tricky
code that I would rather not change in this patch unless really needed.

> * please document why Transients::maintain() is an empty function. Same
> for several of the other no-op members.

Added comments.

> * why should Transients::search() result in fatal()?

For consistency. All similar search() methods do that at the moment (and
nearly all search() methods call fatal() if search URL is specified
while ignoring the search HttpRequest completely). I suspect this was
some unfinished Store API that is currently unused: The Store root
intercepts the search() calls, returning a store_table index. There are
already XXXs in place to mark this problem, and it is not related to
Large Rock or Collapsed Forwarding changes.

> surely it would
> allow more consistent storage area searching if it were another no-op
> function and/or Transients could be treated like a cache (at least for
> lookups).

Possibly, but I would rather not speculate what the finished search
interface will look like, what the callers should expect to find, and
whether Transients would satisfy caller needs best by retuning NULL or
something meaningful.

> * why is Transients::EntryLimit setting a fixed 16K limit on concurrent
> requests? surely this should be the max_filedescriptors limit to handle
> the case where 100% of FD are consumed by client requests?

I think 16K is better than the file descriptors limit, for now:

* Transients pre-allocate their storage (because of difficulties making
dynamically allocated storage SMP-safe).

* Many admins over-specify the FD limit (I have seen values exceeding 100K)

* Most Squid workers cannot handle more than 16K concurrent connections
efficiently.

* The vast majority of collapsed forwarding users will not have more
than 16K concurrent collapsable requests.

* Hitting the limit should not break Squid -- it will just collapse
fewer requests.

Eventually, we may make that constant configurable, but I do not think
we need to do it before we understand what user needs are in this area.
I would not be surprised if collapsing too many requests can be abused
as a DoS attack vector of some sort, for example. Let's get some
feedback before we start optimizing/improving this.

> * please be consistent with the Squid style of function definitions
> having return type on the line above function name. The TransientsRr
> members are all using the more normal definition style.

Fixed both members.

> in src/Transients.h:
> * the terminal #endif comment is not matching the initial file wrapper
> definition.

Fixed.

> * is Transients class really about cache HIT? surely it is about MISS or
> REFRESH which are not yet fully cached.

Both "hit" and "miss" terminology is rather misleading in a collapsed
forwarding case. One of the N collapsed transactions is going to the
origin server, so we could call it a miss, but the remaining N-1
transactions are reading from the same store entry as if it was a hit.
Neither hit nor miss can describe the action as a whole.

I rephrased Transients description in hope to avoid this confusion:

> /// Keeps track of store entries being delivered to clients that arrived before
> /// those entries were [fully] cached. This shared table is necessary to sync
> /// the entry-writing worker with entry-reading worker(s).
> class Transients: public Store, public Ipc::StoreMapCleaner

> in src/client_side_reply.cc
> * please take advantage of touching the debugs() in
> clientReplyContext::identifyStoreObject to remove incorrect
> "clientProcessRequest2:" and HERE in the statements altered.

Done.

> * regarding "TODO: why is !.needValidation required here?"
> - because must-revalidate and other equivalent transactions MUST be
> passed to the origin for validation on every client request. We are not
> permitted to collapse them.

Those two things seems rather independent to me:

  * Request A must reach the origin server.
  * Request B may use (collapse on) A's response.

The MISS code with that TODO comment is about making the [future]
response available to future collapsed requests. It is not about
collapsing the current request (that happens earlier -- in the HIT
handling part of the code).

I polished the TODO comment to reflect my concern that we may be
needlessly prohibiting [future] collapsing there.

> in src/fs/rock/RockForward.h:
> * after updating to the latest trunk you can safely call this file
> forward.h like the rest of the forward declaration files in Squid.

Not yet. Unfortunately, fs-specific files in src/fs/* directories should
be named with a fs-specific prefix because of how src/fs/Makefile.am is
written (and for consistency with other source files in src/fs/*).

> * it has some double-empty lines to remove as well

Removed.

> in src/fs/rock/RockIoState.cc:
> * s/diring/during/

Fixed.

> * please format the documentation for Rock::IoState::tryWrite a bit
> cleaner with "*" as per the rest of the code block comments. Remembering
> that the line starting "/**" if containing text will be used in
> isolation as a one-liner brief description.

Done.

> * Rock::IoState::writeToBuffer can use \return please.

Done.

> in src/fs/rock/RockRebuild.cc
> * why do you seem to be removing maxObjectSize() from Rock store? That
> max-size option should remain even if you are moving to slot-size for
> slot management. If you are retaining maxObjectSize() and I missed it,
> why are the rock store stats dump removing it? AFAICT that function
> should dump both slot-size and max-size details.

Max-size is not being removed. Max-size is an option supported for all
cache_dirs, not just rock. The code dealing with max-size should not be
inside fs/rock. The changes in this patch scope only remove abuse of
max-size for purposes specific to small rock.

> * contains a block of bulk documentation which appears to be for the
> Rebuild constructor but is separated by two empty lines so I'm going to
> assume its not intended to be that way. Please consider formatting it as
> a doxygen module page (search for "defgroup" under src/ for examples) -
> prefix this block with:
> "
> /**
> * \defgroup RockFsRebuild Rock Store Rebuild
> * \ingroup Filesystems
> *
> "
> ... and use \section (new sub-section) and \par (new paragraph) as
> necessary for the main textual formatting.

Done. These are brief "developer notes", not a full-blown API
documentation with sections and such though. FWIW, replacing empty lines
between paragraphs with \par and adding "*" in front of every text line
makes future editing very difficult for me, so these kind of changes
reduce the [already naturally low] probability of me updating these
notes. Perhaps that just means that I need a better editor, but it does
not change the outcome.

> I'm also seeing a lot of formating issues. Do you have astyle 1.23 to
> run the scripts/source-maintenance.sh over the branch?

Done, but I reverted script changes to files that were not modified
before the scripts/source-maintenance.sh run -- I suspect that script is
not very "stable" across build environments.

Thank you,

Alex.

Received on Tue Dec 31 2013 - 19:40:19 MST

This archive was generated by hypermail 2.2.0 : Wed Jan 01 2014 - 12:00:13 MST