Re: memory-mapped files in Squid from Andres Kroonmaa on 1999-01-26 (squid-dev)

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Tue, 26 Jan 1999 19:00:40 +0300 (EETDST)

On 26 Jan 99, at 2:35, Henrik Nordstrom <hno@hem.passagen.se> wrote:

> The basic idea is a cyclical file system (similar to a huge FIFO) but
> with some active recoupling to maintain a good hit rate for a cache.

> > I'm afraid, sequential writes can be only possible when there is ample of
> > free disk space, ie. cache is not full and fragmentation low. You can't hope
> > on that. Perhaps there are ways to ensure sequential free space, but it
> > wouldn't be trivial.
>
> I think you are stuck in the thinking of a normal filesystem here. It is

What does it have to do with normal filesystem? I do not mean ufs block
suballocation here. In any case you have to manage free space and
allocated space, whatever you talk about: be it sectors, tracks,
blocks or chunks. Of 10 objects Squid writes to disk, 4 expires in
less time than other 6. So in this chunk of 10 we have a hole of 4 that
is suboptimal to fill in one-go the next time the cycle wraps. And
if we still do, then of these 4 1-2 again expires sooner than the rest,
and of the former 6 perhaps another 1-2. You end up with chunk of 10
that has 2-3 small holes in it, and you have to handle this somehow.
Else, after some time you have totally random disk access again.

Yes you can ensure free space, by dropping objects that are left
unexpired in such holy chunks, and overwriting them with new objects,
keeping writes sequential. But by this you leave more that 40% of
disks unused. Perhaps even worse, depending severely on how fast
new objects are coming in, expiring, etc.

Or you can move objects around to pack them, freeing space, or use
read-modify-write on chunks thus concatenating multiple ios to 2.
But I wouldn't call this trivial..

> not that hard to rewrite the recycling policy in such a way that there
> always is a couple of minutes contigous free space for storing fresh
> objects.

If you say so.

If you know how to solve the problem of noncontinous free space on a
system that has random expiration of objects and without sacrifying
some >20% of disk space for that, then its way cool. I'm all ears.
I just can't think of a trivial way for that, and thats what I admit.

> > In addition, I think that when accessing disk via cached interface, you
> > have no guarantees that disk ops are scheduled in the same order as you
> > dictate from application. It makes it harder to implement elevator
> > optimisations. But then again, cached interface to raw partition might
> > have other benefits.
>
> Above you effectively said that elevator optimisation won't give a
> performance gain..

How's that? Let me try again.

With raw direct disk io, io is not cached by OS, that is all io is done
directly to/from process address space. As long as there are no competing
requests to the same disk by other apps, disk ops are scheduled the same
order they are requested, and position of disk arm is predictable.
The whole elevator seeking optimisation relies on this.

With using cached disk interface, IO goes through OS buffer cache, that is
every write is buffered and delayed, every read is cached, but immediately
scheduled. You don't have anymore strict predictability of disk arm location
over the disk platters. Even if OS schedules writes in the same order
they were presented, competing reads would move disk arms randomly.
This contradicts elevator optimisation, and you effectively rely on OS to
do the optimisation.

If you meant my comment on readv, then I was not saying it makes elevator
seeking useless, I meant to say that there is no need for using readv. I
don't think OS/app context switch overhead is so big that you need to
cluster these calls together into one syscall. I meant to say that there
is no big difference whether you readv multiple blocks in a single syscall,
or read multiple blocks in multiple non-blocking syscalls.

> Basic idea is that writing is always done sequencially in a cyclical
> fashion (when the end is reached, restart at the beginning).
> Unfortunately we most likely need something extra besides this to avoid
> throwing away objects that we are interested in keeping in the cache,

This is exactly what I meant by being nontrivial...

> and for this I have tree ideas which are emerging and I am trying to
> find a proper name for this. The full working name is something like
> chunked cyclical multilevel feedback file system which should give an
> idea of what is involved. multilevel is probably the wrong word here but
> I havent figured out another word to use for that part yet, maybe
> garbage collecting is more appropriate for describing that part but the
> truth is somewhere in between.

Hmm. I'm really interested in hearing more on that. Sounds promising...

----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
----------------------------------------------------------------------
Received on Tue Jul 29 2003 - 13:15:56 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:02 MST