Re: [PATCH] Make SBuf::find_first_of O(N) and implement find_first_not_of

From: Kinkie <gkinkie_at_gmail.com>
Date: Mon, 16 Dec 2013 22:59:34 +0100

On Mon, Dec 16, 2013 at 5:40 PM, Alex Rousskov
<rousskov_at_measurement-factory.com> wrote:
> On 12/15/2013 04:54 AM, Kinkie wrote:
>
>> attached patch is from branch lp:~squid/squid/characterset/
>
>
>> + std::vector<bool> chars_; //std::vector<bool> is optimized
>
> According to STL documentation for std::vector<bool> specialization:
>
>> These changes provide a quirky interface to this specialization and
>> favor memory optimization over processing (which may or may not suit
>> your needs).
>
> This is kind of the opposite of what we want. We want access speed, not
> RAM or copying speed savings. I suspect we do not want std::vector<bool>
> here! The same docs suggest using a different element type (char,
> unsigned char, or custom) if a different optimization trade-off is
> needed. I think we should follow that advice if we want to optimize this
> (which was the primary point of doing a custom CharacterSet).

Confirmed via a quick test. On MacOS/clang, std::vector<unsigned char>
is exactly twice as fast as std::vector<bool>.

>> It extracts some of the work done in parser-ng to implement Alex'
>> Tokenizer API, bringing O(N) find-any{,-not}-of matching to SBuf.
>
>> Note: in IRC discussions, Amos mentioned find_first_of and
>> find_first_not_of not being naming convention compliant (that's right,
>> they are modeled after std::string). I'm fine with changing them as a
>> follow-up commit if this is the collegial decision; I'd avoid do to it
>> as part of this merge.
>
> I agree that the two should be renamed to use camelCase [as a follow up
> commit].

ok, thanks.

>> + * \return length() if all characters in the SBuf are from set
>
> and the implementation matches that documentation:
>
>> +SBuf::size_type
>> +SBuf::find_first_not_of(const CharacterSet &set, size_type startPos) const
>> +{
> ...
>> + debugs(24, 7, "not found");
>> + return length();
>> +}
>
> but STL docs say:
>
>> Return Value
>> The position of the first character that does not match.
>> If no such characters are found, the function returns string::npos.
>
> which makes more sense than returning length() IMO.

I had not checked the STL docs, sorry; I didn't remember that it
offered this method.
From the tokenizer's point of view, returning length() would make
sense (one less special case to handle). But it's not a big deal, I
can simply handle that case explicitly and this API may make more
sense for non-tokenizer users.

>> + CharacterSet(const char *label, const char * const c) : name(label), chars_(std::vector<bool>(256,false)) {
>> + size_t clen = strlen(c);
>> + for (size_t i = 0; i < clen; ++i)
>> + chars_[static_cast<uint8_t>(c[i])] = true;
>> + }
>
> Since we are using set.name [for debugging] without checking, please set
> name to "anonymous" or similar if the label parameter is NULL

Ok.

>> + size_t clen = strlen(c);
>
> This variable can be const.

Ok

>> + size_t clen = strlen(c);
>> + for (size_t i = 0; i < clen; ++i)
>> + chars_[static_cast<uint8_t>(c[i])] = true;
>
> Please use add() to add characters to a set.

Ok

>> + /// add a given char to the character set. c must be >= 0.
>> + CharacterSet & add(const unsigned char c) {chars_[static_cast<uint8_t>(c)] = true; return *this; }
>
> The "c must be >= 0" comment is redundant given that "c" is unsigned. If
> you are worried about folks adding negative signed chars, try adding a
> corresponding add(const signed char) method with a check.

Yup. Relic from when c was signed.
Regarding signed-to-unsigned conversions, I believe we have the
compiler catch them now, don't we?

>> + /// name of this character set
>> + const char * name;
>
> The comment does not really document the data member (the data member
> name itself says exactly the same thing already). Consider:
>
> /// optional set label for debugging ("anonymous" by default)

Ok

>> +#include <vector>
>> +
>> +class CharacterSet
>
> Missing class description. Consider:
>
> /// An optimized set of C chars,
> /// with a quick membership test and merge support.

Ok

>> + /// characters defined in this set
>> + std::vector<bool> chars_; //std::vector<bool> is optimized
>
> s/defined/present/ or s/defined/included/

Present sounds better

> * CharacterSet creation and merging (+=) are heavy operations that do
> not benefit from being inlined. I know it is a pain to deal with
> Makefile dependencies, and I do not insist on this change, but please do
> seriously consider placing them in .cc file.

If we were to remove them from the class definition but leave them in
the .h-file, would they still be inlined? (that's the easy way out).
Otherwise, I'll try to put them in a new .cc file.

> FWIW, the following problems can be solved easier or better in a .cc
> file. With a little bit of additional work, you can even remove #include
> <vector> from CharacterSet users that way!

I don't see how if we want to leave CharacterSet::operator[] inlined,
and that's arguably the most time-critical operation this code
contains.

>> + //precondition: src.chars_.size() == chars_.size()
>
> There is no such precondition. I should be able to merge ALPHA and
> NUMERIC sets, for example. Rewrite the += operator loop to simply
> this->add() every character in the src set. Use std::for_each or another
> <algorithm> for that if possible.

The choice to make the vector size fixed does have some sense.
std::vector::operator[] doesn't resize. If we want to make
variable-size vectors, then add() becomes heavy because it must
reserve() to resize chars_ if the vector is not big enough, and then
we should de-inline add.. I'm fine either way, I leave the choice to
you.

>> + CharacterSet(const char *label, const char * const c) : name(label), chars_(std::vector<bool>(256,false)) {
>
>> + std::vector<bool>::const_iterator s = src.chars_.begin();
>
>> + const std::vector<bool>::const_iterator e = src.chars_.end();
>
>> + std::vector<bool> chars_; //std::vector<bool> is optimized
>
> Please avoid code duplication by typedef-ing std::vector<bool> as Chars
> or similar and using the corresponding type name everywhere.

Done.

As soon as I have feedback on the open points, I'll implement them and
post a new review candidate.

Thanks!
  Kinkie
Received on Mon Dec 16 2013 - 21:59:43 MST

This archive was generated by hypermail 2.2.0 : Tue Dec 17 2013 - 12:00:11 MST