Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Structures in Riak

Data Structures in Riak

Since the beginning, Riak has supported high write-availability using Dynamo-style multi-valued keys – also known as conflicts or siblings. The tradeoff for this type of availability is that the application must include logic to resolve conflicting updates. While it is convenient to say that the application can reason best about conflicts, ad hoc resolution is error-prone and can result in surprising anomalies, like the reappearing item problem in Dynamo’s shopping cart.

What is needed is a more formal and general approach to the problem of conflict resolution for complex data structures. Luckily, there are some formal strategies in recent literature, including Conflict-Free Replicated Data Types (CRDTs) and BloomL lattices. We’ll review these strategies and cover some recent work we’ve done toward adding automatically-convergent data structures to Riak.

Basho Technologies

October 11, 2012
Tweet

More Decks by Basho Technologies

Other Decks in Technology

Transcript

  1. Siblings in Riak HTTP/1.1  300  Multiple  Choices X-­‐Riak-­‐Vclock:   a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzH

    YVKbIhEUl +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA== Vary:  Accept,  Accept-­‐Encoding Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the   frantic) Date:  Fri,  30  Sep  2011  15:24:35  GMT Content-­‐Type:  text/plain Content-­‐Length:  102 Siblings: 16vic4eU9ny46o4KPiDz1f 4v5xOg4bVwUYZdMkqf0d6I 6nr5tDTmhxnwuAFJDd2s6G 6zRSZFUJlHXZ15o9CG0BYl
  2. Siblings in Riak HTTP/1.1  300  Multiple  Choices X-­‐Riak-­‐Vclock:   a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzH

    YVKbIhEUl +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA== Vary:  Accept,  Accept-­‐Encoding Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the   frantic) Date:  Fri,  30  Sep  2011  15:24:35  GMT Content-­‐Type:  text/plain Content-­‐Length:  102 Siblings: 16vic4eU9ny46o4KPiDz1f 4v5xOg4bVwUYZdMkqf0d6I 6nr5tDTmhxnwuAFJDd2s6G 6zRSZFUJlHXZ15o9CG0BYl list of siblings
  3. Siblings in Riak HTTP/1.1  300  Multiple  Choices X-­‐Riak-­‐Vclock:   a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzHYVKbIhEUl

    +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA== Vary:  Accept,  Accept-­‐Encoding Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the  frantic) Date:  Fri,  30  Sep  2011  15:24:35  GMT Content-­‐Type:  multipart/mixed;  boundary=YinLMzyUR9feB17okMytgKsylvh Content-­‐Length:  766 -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/x-­‐www-­‐form-­‐urlencoded Link:  </riak/test>;  rel="up" Etag:  16vic4eU9ny46o4KPiDz1f Last-­‐Modified:  Wed,  10  Mar  2010  18:01:06  GMT {"bar":"baz"} -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/json Link:  </riak/test>;  rel="up" Etag:  4v5xOg4bVwUYZdMkqf0d6I Last-­‐Modified:  Wed,  10  Mar  2010  18:00:04  GMT {"bar":"baz"} -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/json Link:  </riak/test>;  rel="up"
  4. Siblings in Riak HTTP/1.1  300  Multiple  Choices X-­‐Riak-­‐Vclock:   a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzHYVKbIhEUl

    +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA== Vary:  Accept,  Accept-­‐Encoding Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the  frantic) Date:  Fri,  30  Sep  2011  15:24:35  GMT Content-­‐Type:  multipart/mixed;  boundary=YinLMzyUR9feB17okMytgKsylvh Content-­‐Length:  766 -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/x-­‐www-­‐form-­‐urlencoded Link:  </riak/test>;  rel="up" Etag:  16vic4eU9ny46o4KPiDz1f Last-­‐Modified:  Wed,  10  Mar  2010  18:01:06  GMT {"bar":"baz"} -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/json Link:  </riak/test>;  rel="up" Etag:  4v5xOg4bVwUYZdMkqf0d6I Last-­‐Modified:  Wed,  10  Mar  2010  18:00:04  GMT {"bar":"baz"} -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/json Link:  </riak/test>;  rel="up" all the values
  5. Semantic Resolution • Your app knows the domain - use

    business rules to resolve • Amazon Dynamo’s shopping cart
  6. Semantic Resolution • Your app knows the domain - use

    business rules to resolve • Amazon Dynamo’s shopping cart BAD
  7. Semantic Resolution • Your app knows the domain - use

    business rules to resolve • Amazon Dynamo’s shopping cart BAD “Ad hoc approaches have proven brittle and error-prone”
  8. WARNING This is a lot of math. Side effects may

    include dry mouth, itchy rash, and a desire to go back for a PhD.
  9. Monotonic Functions • Change in strictly a single direction •

    Consecutive values may be equal • Monotonic: Linear, Exponential • Non-monotonic: Quadratic, Sinusoidal
  10. Monotonic Functions • Change in strictly a single direction •

    Consecutive values may be equal • Monotonic: Linear, Exponential • Non-monotonic: Quadratic, Sinusoidal
  11. Monotonic Logic •Existing facts are never refuted •New facts can

    be added •“Knowledge only grows” “monotonicity of entailment”
  12. Bounded Join Semi-Lattice ʪS, ⊔, ⊥ʫ ⊔ is a least-upper

    bound function ∀x, y ∈ S, ∃z ∈ S: x ⊔ y = z
  13. Bounded Join Semi-Lattice ∀x, y ∈ S: x ≤S y

    㱻 x ⊔ y = y “partial order” ʪS, ⊔, ⊥ʫ ⊔ is a least-upper bound function ∀x, y ∈ S, ∃z ∈ S: x ⊔ y = z
  14. Bounded Join Semi-Lattice ∀x, y ∈ S: x ≤S y

    㱻 x ⊔ y = y “partial order” ∀x ∈ S: x ⊔ ⊥ = x “identity” ʪS, ⊔, ⊥ʫ ⊔ is a least-upper bound function ∀x, y ∈ S, ∃z ∈ S: x ⊔ y = z
  15. “Set” Lattice {a} {b} {c} {d} {e} {a,b} {b,c} {c,d}

    {d,e} {a,b,c} {c,d,e} {b,c,d,e} {a,b,c,d} {b,c,d} {a,b,c,d,e} Time S = all finite sets ⊔ = set-union ⊥ = {}
  16. • Vector clock is a lattice... Vector Clock S =

    all vectors of (Actor, Count) pairs ⊔ = All Actors, each with their max Count ⊥ = [] (empty vector)
  17. • Vector clock is a lattice... • ...but the associated

    Riak value is non- monotonic, Vector Clock S = all vectors of (Actor, Count) pairs ⊔ = All Actors, each with their max Count ⊥ = [] (empty vector)
  18. • Vector clock is a lattice... • ...but the associated

    Riak value is non- monotonic, • ...and the vclock is not meaningful to the client. Vector Clock S = all vectors of (Actor, Count) pairs ⊔ = All Actors, each with their max Count ⊥ = [] (empty vector)
  19. CRDT Flavors • Convergent (state-based) • One replica updates, then

    forwards entire state, downstream merges • Commutative (operation-based) • Only mutations (ops) communicated • Needs a reliable broadcast channel
  20. CRDT Types Registers LWW, MV Counters Positive, P/N Sets Grow

    only, Two-Phase, Observed-Remove Graphs 2P-2P Lists Growable-array Collaborative editing Treedoc
  21. •new/0 empty CRDT •value/1 the resolved value •update/3 mutate CRDT

    •merge/2 converge two CRDTs •equal/2 compare internal value CRDT Behaviour
  22. G-Counter •Simple version vector (28 LoC) [{ActorId,Count}] •Update: increment actor’s

    count •Merge: greatest value per Actor •Value: sum of Counts
  23. G-Counter new()  -­‐>        []. value(GCnt)  -­‐>  

         sum([Cnt  ||  {_Act,  Cnt}  <-­‐  GCnt]). equal(VA,VB)  -­‐>        lists:sort(VA)  =:=  lists:sort(VB).
  24. PN-Counter •2 x G-Counter •P - N = value {

       P  =  [{a,10},{b,2}],    N  =  [{a,1},{c,5}] } (10  +  2)  -­‐  (1  +  5)      =  12  -­‐  6      =  6
  25. Riak DT In Action •Bitcask storage per vnode •Value /

    Update FSM per request •Webmachine resource(s) e.g. GET  /counters/key
  26. Update FSM •Sync call update on vnode •Read, Local Update,

    Reply •Async send merge to replicas •Await W responses •Reply to client
  27. Value FSM (Read) •Async call value on all replicas •Await

    R replies •Merge all replies with merge/2 •Return merged value to client •Read Repair