OpenDaylight DataStore分析

网上关于DataStore内部架构和实现细节太少了，大部分都是讲如何使用DataStore，而很少有分析从client开始读写到内存到硬盘的数据流，也没有相关性能分析之类的。由于不能满足『知其所以然』的需求，我只能自己看了。然而，官方的文档真的是杂而乱，在社区问了几个问题也没有反馈……无力吐槽了。以下是我吐血看文档和源码总结的，版本为Beylium，如果有不当之处，希望各位指出。
本文首先简单讲一下MD-SAL的大体架构，然后介绍DataStore内容，包括其内存db和硬盘db。内存db也就是所谓的In-Memory-DataStore主要存储两棵树：Operational Tree和 Config Tree。硬盘db对内存db序列化后写入硬盘，包括snapshot和journal两部分，用于重启后恢复内存db。

1.MD-SAL

MD-SAL为控制器核心部件，主要特点是统一的南北向接口、YANG模型驱动，生产者-消费者模式。其主要用于路由RPC、Notification以及统一的DataStore存储。几乎所有数据类型交互都是根据YANG-model。下图展示了其主要架构：
md-sal

其中，BA Broker主要负责中转消息；Binding Generator包括YANG parser根据YANG文件生成对于的java代码；Schema Repository存储YANG-JAVA的对应关系；BI Broker为主要消息处理中心：建立Provider和Consumer的连接，传递RPC、notification、data change message；BI Data Repository为DataStore所在，存储树形数据。
BI Data的数据格式主要包括DOMNode和NodeModification两种类型，前者为DataStore中的结点类型，后者为对结点进行修改的操作类。
bidataformat

BI Broker主要包括了以下4种功能，主要介绍一下第4个，因为这个和DataStore关系最大。

Provider和Consumer的注册。
Notification Hub
RPC路由
System state Access & Modification。BI Broker对DataStore的数据访问采用two-phase commit协议，访问数据包括3种角色：Initiator，Coordinator和Cohort。Initiator指的是操作的发起者，比如我们的service代码write命令，发起者就是Initiator，Coordinator则是BI Broker，Cohort指的是命令的接受者，此处应该为Transaction（在MD-SAL里面，对DataStore数据的操作应该通过Transaction，当然也可以不通过，不过最好使用Transaction）。Two-phase commit采用两阶段提交的方法，具体内容如下：
第一阶段：
Coordinator发送commit-request给所有的cohorts
cohorts执行transaction(commit)
cohorts回复成功与否消息
第二阶段：
coordinator在所有cohorts都返回成功才会完成transaction，若失败：
coordinator发送rollback命令给所有cohorts
所有cohorts rollback transaction
所有cohorts发回确认回复给coordinator
coordinator rollback transaction并回复错误消息给initiator

在DataStore部分，我将会介绍一下two-phase commit是怎么用在InMemory DataStore里的。

2.DataStore

DataStore包括内存DataStore和硬盘DataStore，下图为简单示意图，内存DataStore中的Config Tree会被同步到硬盘中去，而Operational Tree不会。

2.1 In-Memory DataStore

In-Memory DataStore主要特点如下：1.采用YANG定义的数据类型；2.two-phase commit；3.异步读写；4.并行操作，互相隔离；5.Transaction操作；6.data change通知；7.Config Tree和Operation Tree。
In-Memory DataStore里面的两棵树：Config Tree和Operation Tree的结点基类为NormalizedNode，结点类存在父，子结点的『指针』，便于遍历、查找。一共有以下几种结点类型继承自NormalizedNode，与YANG model中的数据类型相对应：

实际上，结点类型只有两种：叶子结点LeafNode和非叶子节点LeafSetEntryNode，LeafSetEntryNode的子结点可以为LeafSetEntryNode或者LeafNode，而LeafNode为最后的叶子节点。对于树的修改操作步骤是：1.fork出一颗树（此处可能是一颗子树，因为这样做效率比较低，本人不太确定，如果有确定的麻烦告知一下）2.对树中结点进行修改 3.merge到原来的树中。

2.1.1隔离操作

多次读写互相隔离，这是因为不同的transaction会fork出不同的树，然后各自操作各自的树，最后merge判断是否ok。下面是一个简单的例子说明隔离操作：

tx1 = broker.newReadWriteTransaction();  
tx2 = broker.newReadWriteTransaction();

tx1.read(OPERATIONAL,PATH).get();       // A  
tx2.read(OPERATIONAL,PATH).get()        // A

tx2.put(OPERATIONAL,PATH,B);            // 写 B  
tx2.read(OPERATIONAL,PATH).get()        // B

tx1.read(OPERATIONAL,PATH).get();       // A  
tx1.put(OPERATIONAL,PATH,C);            // 写 C  
tx1.read(OPERATIONAL,PATH).get()        // C

tx2.read(OPERATIONAL,PATH).get()        // B

tx2.commit().get();                     //commit  
tx1.read(OPERATIONAL,PATH).get();       // C  
tx1afterCommit = broker.newReadOnlyTransaction();  
tx1afterCommit.read(OPERATIONAL,PATH).get(); //B

tx1.commit()                            // 返回失败，因为对同一个路径操作

最后tx1 commit失败是因为与tx2对同一个数据进行操作。ODL源码中给了一些例子表明哪些操作会失败，在叶子结点的情况下，对同一结点修改操作将会导致失败。
commit1

commit2

2.1.2 Notification

在In-Memory DataStore中，存在3中监听事件：Base表示结点本身发生更改；One表示直接儿子结点发生更改；Subtree表示结点或者子树发生更改。
对于结点的修改类型包括显示修改和隐式修改两种，显示包括：Insert，Replace和Delete；隐私则为子节点修改二导致的本结点version发生修改，下面会介绍version的作用。
version为结点中的一个属性，用来监控树的变换：如果version发生更改，则表示结点本身或者子树中某个结点发生更改。version更改规律如下：

insert child: child.version = ++parent.version;
replace child: child.version = ++parent.version;
delete child: ++parent.nodeVersion;

以上操作为递归操作，这意味着，一旦某个结点发生更改，则从其开始递归到根结点，所有的version都会发生更改。所有监听事件也就通过这个version号是否发生变化来触发监听事件。

2.1.3 Two-phase commit

之前介绍到了Two-phase commit，那么这个协议是怎么用到我们的In-Memory DataStore中的。以下问官方的文档解释，为了防止翻译出现的歧义，我就直接上原文了：

The Data Store is a participant in the two-phase commit (as a commit handler).  
1.The Request Commit Phase  
   (1).Reference to the initial state is captured
   (2).Data store creates a new subtree by applying specified operations that affect that node
   (3).Data store captures the set of affected data change listeners with the initial state (reference to the old-subtree) and the new state (reference to the new subtree)
   (4).Data store propagates the new subtree to the parent node and applies atomic operations on the parent node until the root node of the data store is replaced.
2.The Finish Phase  
   (1).Data store replaces the reference to the root element to newly created root element.
   (2).Data store finishes the transaction.
   (3).All captured affected listeners are notified with both the initial state and the new state.

2.1.4 Config Tree & Operational Tree

两者都以相同的树结构存在于内存，不同点在于Config Tree支持GET/PUT/POST/DELETE，Operational Tree仅支持POST（GET不知道是否支持）；Config Tree保存全局配置，会被持久化到硬盘DB，重启后会恢复，而Operational Tree只保存运行时数据，重启后失效。

2.1.5 Data Transaction

下面几个例子分别解释了数据read，write，cancle，commit等发生的数据流，来自于wiki，比较清楚了，我就不做过多解释了。

2.1.5.1 Data Store Read Operation

transaction1

baDataBroker.readOperationalData(yang.binding.InstanceIdentifier)
connector.readOperationalData(…)
3.domInstanceIdentifier = mappingService.toDataDom(bindingInstanceIdentifier)
biDataBroker.readOperationalData(domInstanceIdentifier)
domData = dataStore.readOperationalData(domInstanceIdentifier)
domData is returned to caller
domData is returned to caller
baData = mappingService.toDataObject (bindingInstanceIdentifier, domData)
baData is returned to caller
baData is returned to caller
baData is returned to caller

2.1.5.2 Create a BA-to-DOM Transaction

transaction2

baDataBroker.beginTransaction()
connector.beginForwardedTransaction()
biDataBroker.beginTransaction()
storeTx = dataStore.beginTransaction() // storeTx contains snapshot
storeTx is returned to caller
domTx = createDomTx(storeTx)
return domTx to caller
baDomTx = createForwardedTransaction(domTx)
Return baDomTx to ForwardedDataBroker
Return baDomTx to client
Return baDomTx to client

2.1.5.3 Read Operation on a Transaction

storeTx中挂了一个Snapshot和Modification，分别表示原树和修改后的树。没有commit之前两棵树同时存在。 transaction3

1. baDomTx.readOperationalData(yang.binding.InstanceIdentifier)
2. // lookup in local cache, asume fail – performance optimalization
3. domInstanceIdentifier = mappingService.toDataDom(bindingInstanceIdentifier)
4. domTx.readOperationalData(domInstanceIdentifier)
5. domData = storeTx.readOperationalData(domInstanceIdentifier)
6. domData is returned to caller
7. domData is returned to caller
8. baData = mappingService.toDataDom(bindingInstanceIdentifier, domData)
9. baData is returned to caller
10. baData is returned to caller
11. baData is returned to caller

2.1.5.4 Cancel a Transaction

transaction4

baDomTx.cancel()
baDomTx clean ups all local state (cache of DTO, references to services)
domTx.cancel()
dataBroker.cancel(domTx)
domTx clean ups all local state (cache of DTO, references to services)
storeTx.cancel() // Asynchronous – need to keep track of
storeTx clean ups all local state (modifications, snapshot)

2.1.5.5 Write Operation on a Transaction

transaction5

1. baDomTx.putOperationalData(yang.binding.InstanceIdentifier,Node)
2. // Cache cleanup of subtree and parent nodes
3. domII,domData = mappingService.toDataDom(bindingInstanceIdentifier,baData)
4. domTx.putOperationalData(domInstanceIdentifier,domData)
5. storeTx.replaceData(domInstanceIdentifier,domData)
6. storeTx updates modification index

2.1.5.6 Transaction Commit

注意5为Two-phase commit中的第一次commit，7为cohort回复确认，9为第二次commit。第一阶段完了后，修改后的树就会合入Snapshot，直至第二次commit，再合入Data Store，否则还可能会rollback。
transaction6

baDomTx.commit() // Submits to commit queue
domTx.commit()
dataBroker.commit(domTx)
Start of two Phase commit – requestCommit on Commit Handlers
storeTx.requestCommit() // Asynchronous – need to keep track of
storeTx creates optimistic snapshot(合并后的db)
dataTx returns goAhead
End of Two phace commit – finish callback on Commit Handlers
storeTx.finish()
dataStore.finish(storeTx)
dataStore replaces snapshot

2.2 硬盘DataStore

其实都不能叫DB了，因为硬盘存在的数据是用来恢复。在硬盘DB里，内存DataStore根据akka和RAFT协议将snapshot和日志（journal）写入磁盘。snapshot为控制器刚启动时将ConfigTree序列化后写入磁盘，journal则为每次对内存的Config Tree结点进行修改后，序列化写入的磁盘。相当于snapshot保存的是一个完整的数据库，journal保存的是每次的数据增量，这样，再控制器重启后，就会根据这两个文件，恢复成完整的Config Tree。默认情况下，两个文件分别位于snapshots和journal文件夹下。而内存写入磁盘通过的是Shard/ShardTransaction类，接收用户更改操作，写入硬盘journal，这两个都是Actor。下面是AKKA中关于Actor，Shard和ShardTransaction的具体解释：
twophase

threephase

内存写入硬盘不是调用的Two-phase commit，而是Three-phase commit，如上面两图所示。后者是在前者基础上进行改进的一个协议，前者最大的缺点在于cohort是阻塞的，这意味着一旦某一时候Coordinator挂掉了，那么所有Cohorts都会阻塞等待，这个在内存里面其实不存在什么问题，因为内存里面要是Coordinator挂掉了，也就意味着BI broker挂掉了，也就意味着控制器都挂了，所以Cohorts也就挂掉了。但是在写入硬盘，需要考虑分布式概念，比如Coordinator挂掉了，这个很有可能发生，相当于某个结点挂了，但是此处Cohorts可以为本机，也可以为别的机器，这个时候如果采用阻塞让别的结点一直等待，那么根本不可行。所以就出现了Three-phase commit，其加入超时机制，还有3阶段提交更加稳定，其缺点当然还是性能问题。

Actor：Akka persistence enables stateful actors to persist their internal state so that it can be recovered when an actor is started, restarted after a JVM crash or by a supervisor, or migrated in a cluster.
Shard：Since the Shard is a Processor, in accordance with akka-persistence, it is a special actor which when passed a Persistent message will log it to a journal. A ShardTransaction would be an actor which wraps an
ShardTransaction：InMemoryDataStoreTransaction. Any operation that needs to be done on a transaction, namely, ""read"", ""write"", ""delete"", and ""ready"" would be fronted by the ShardTransaction. The ShardTransaction will also maintain the state of any writes/deletes that happen on a transaction. This state will be called the "transactionLog". The transactionLog would then be used during commits to persist a transaction to a journal. The journal will be written onto the disk using the persistence module of Akka. The journal will then be used when a controller shards up to reconstruct the state of a shard.

这里，存在一个问题，就是用户的一个写操作到底是什么时候返回？其实，一开始我以为是写入内存后返回的，然后触发Tree-phase commit阶段后立刻返回，接着Tree-phase commit3步commit成功与否就不管了。但其实好像不是，是用户调用write明，Two-phase commit写入内存，然后调用Three-phase commit写入磁盘，再返回的。不过此处不是特别确认，如果有谁比较清楚的，希望能告知一下，谢谢~
Shard其实是一个分布式里数据碎片的概念，当启用集群后，一个shard可以位于多台机器上，那么以上Three-phase commit的性能代价就更低了，因为还要远程写操作，好处则是保证可靠性，防止宕机。理论上，一个service的数据位于一个shard内，而不能跨多个shard，但不同shard可以有数据重合部分。

说明

对于本文，我可能存在一些无法理解和理解不当的地方，如果有，欢迎大家留言指正（不过最近评论插件墙内不能用了）。
装载请注明出处：http://vinllen.com/opendaylight-datastorefen-xi/

参考

https://wiki.opendaylight.org/view/OpenDaylight_Controller:Binding-Independent_Components
https://wiki.opendaylight.org/view/OpenDaylight_Controller:MD-SAL:Architecture:Clustering#ThreePhaseCommitCohortProxy
https://wiki.opendaylight.org/view/OpenDaylight_Controller:MD-SAL:Design:Normalized_DOM_Model
https://wiki.opendaylight.org/view/OpenDaylight_Controller:MD-SAL:Architecture:DOM_DataStore
https://wiki.opendaylight.org/view/OpenDaylight_Controller:MD-SAL:Architecture:DOM_DataStore:Transactions
http://events.linuxfoundation.org/sites/events/files/slides/MD-SAL%20Clustering%20Internals.pdf
https://wiki.opendaylight.org/view/OpenDaylight_Controller:MD-SAL:Architecture:Clustering
http://doc.akka.io/docs/akka/current/scala/persistence.html
https://en.wikipedia.org/wiki/Two-phase_commit_protocol
https://en.wikipedia.org/wiki/Three-phase_commit_protocol

Your browser is out-of-date!

Vinllen Chen