From 05932293f659c0edfaa0352208ff2a89712b71ac Mon Sep 17 00:00:00 2001 From: Vincent Sanders Date: Thu, 10 Apr 2014 11:56:15 +0100 Subject: Add filesystem based backing store --- Docs/source-object-backing-store | 194 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 194 insertions(+) create mode 100644 Docs/source-object-backing-store (limited to 'Docs') diff --git a/Docs/source-object-backing-store b/Docs/source-object-backing-store new file mode 100644 index 000000000..e55a99db3 --- /dev/null +++ b/Docs/source-object-backing-store @@ -0,0 +1,194 @@ +Source Object (low level) cache backing store +============================================= + +Introduction +------------ + +The source object cache provides a system to extend the life of source +objects (html files, images etc.) after they are no longer immediately +being used. + +Only fetch types where we have well defined rules on caching are +considered, in practice this limits us to HTTP(S). The section in +RFC2616 [1] on caching specifies these rules. + +To futher extend the objects lifetime they can be pushed into a +backing store where the objects are available for reuse less quickly +than from RAM but faster than retriving from the network again. + +The backing store implementation provides a key:value infrastructure +with a simple store, retrive and invalidate interface. + +Generic filesystem backing store +-------------------------------- + +Although the backing store interface is fully pluggable a generic +implementation based on storing objects on the filesystem in a +heirachy of directories. + +The option to alter the backing store format exists and is controled +by a version field. It is implementation defined what happens if a +version mis-match occours. + +As the backing store only holds cache data one should not expect a +great deal of effort to be expended converting formats (i.e. the cache +may simply be discarded). + +Layout version 1 +---------------- + +An object has an identifier value generated from the url (NetSurf +backing stores uses the url as the unique key). The value used is +obtained using nsurl_hash() which is currently a 32 bit FNV so is +directly usable. + +This identifier is adequate to ensure the collision rate for the +hashed url values (a collision for every 2^16 urls added) is +sufficiently low the overhead of returning the wrong object (which +backing stores are permitted to do) is not significat. + +An entry list is maintained which contains all the metadata about a +given identifier. This list is limited in length to constrain the +resources necessary to maintain it. It is made persistant to avoid the +overhead of reconstructing it at initialisation and to keep the data +used to improve the eviction decisions. + +Each object is stored and retrived directly into the filesystem using +a filename generated from a base64url encoding of an address +value. The objects address is derived from the identifier by cropping +it to a shorter length. + +A mapping between the object address and its entry is maintained which +uses storage directly proportional to the size of the address length. + +The cropping length is stored in the control file with the default +values set at compile time. This allows existing backing stores to +continue operating with existing data independantly of new default +setting. This setting gives some ability to tune the default cache +index size to values suitable for a specific host operating system. + +E.g. Linux based systems can easily cope with several megabytes of +mmaped index but RISC OS might want to limit this to a few megabytes +of heap at most. + +The files are stored on disc using their base64url address value. +By creating a directory for each character of the encoded filename +(except the last which is of course the leafname) we create a +directory structure where no directory has more than 64 entries. + +E.g. A 19bit address of 0x1 would be base64url encoded into AAAB +resulting in the data being stored in a file path of +"/store/prefix/data/B/A/A/BAAAAA". + +An address of 0x00040001 encodes to BAAB and a file path of +"/store/prefix/meta/B/A/A/BAABAA" + +Control files +~~~~~~~~~~~~~ + +control ++++++++ +A control file is used to hold a list of values describing how the +other files in the backing store should be used. + +entries ++++++++ + +this file contains a table of entries describing the files held on the +filesystem. + +Each control file table entry is 28 bytes and consists of + + - signed 64 but value for last use time + + - 32bit full url hash allowing for index reconstruction and + addiitonal collision detection. Also the possibility of increasing + the ADDRESS_LENGTH although this would require renaming all the + existing files in the cache and is not currently implemented. + + - unsigned 32bit length for data + + - unsigned 32bit length for metadata + + - unsigned 16bit value for number of times used. + + - unsigned 16bit value for flags + + - unsigned 16bit value for data block index (unused) + + - unsigned 16bit value for metatdata block index (unused) + +Address to entry index +~~~~~~~~~~~~~~~~~~~~~~ + +An entry index is held in RAM that allows looking up the address to +map to an entry in the control file. + +The index is the only data structure whose size is directly depndant +on the length of the hash specificaly: + +(2 ^ (ADDRESS_BITS - 3)) * ENTRY_BITS) in bytes + +where ADDRESS_BITS is how long the address is in bits and ENTRY_BITS +is how many entries the control file (and hence the while +cache) may hold. + +RISCOS values ++++++++++++++ + +By limiting the ENTRY_BITS size to 14 (16,384 entries) the entries +list is limited to 448kilobytes. + +The typical values for RISC OS would set ADDRESS_BITS to 18. This +spreads the entries over 262144 hash values which uses 512 kilobytes +for the index. Limiting the hash space like this reduces the +efectiveness of the cache. + +A small ADDRESS_LENGTH causes a collision (two urls with the same +address) to happen roughly for every 2 ^ (ADDRESS_BITS / 2) = 2 ^ 9 = +512 objects stored. This roughly translates to a cache miss due to +collision every ten pages navigated to. + +Larger systems +++++++++++++++ + +In general ENTRY_BITS set to 16 as this limits the store to 65536 +objects which given the average size of an object at 8 kilobytes +yeilds half a gigabyte of disc used which is judged to be sufficient. + +For larger systems e.g. those using GTK frontend we would most likely +select ADDRESS_BITS as 22 resulting in a collision every 2048 objects +but the index using some 8 Megabytes + +Typical values +-------------- + +Example 1 +~~~~~~~~~ + +For a store with 1034 objects genrated from a random navigation of +pages linked from the about:welcome page. + +Metadata total size is 593608 bytes an average of 574 bytes. The +majority of the storage is used to hold the urls and headers. + +Data total size is 9180475 bytes a mean of 8879 bytes 1648726 in the +largest 10 entries which if excluded gives 7355 bytes average size + +Example 2 +~~~~~~~~~ + +355 pages navigated in 80 minutes from about:welcome page and a +handful of additional sites (google image search and reddit) + +2018 objects in cache at quit. 400 objects from news.bbc.co.uk alone + +Metadata total 987,439 bytes mean of 489 bytes + +data total 33,127,831 bytes mean of 16,416 bytes + +with one single 5,000,811 byte gif + +data totals without gif is 28,127,020 mean 13,945 + +[1] http://tools.ietf.org/html/rfc2616#section-13 \ No newline at end of file -- cgit v1.2.3