In this post, I am going on the Research overview started in the last post. Last week I was at the European Business Intelligence Summer School (eBISS 2011) in Paris. The objective was to give a complete overview on the researches and evolution of Bi, viewed by the best of bread researchers and industrials. In this post we continue to describe the main important topics that was exposed.
Semantic BI [2]: This domain starts from the fact that enterprises will need, somehow, to integrate data coming from the web to complete internal DW. For instance, getting trends on Social Networks or getting more information about a possible merge between companies, etc. In addition, enterprises or governments will also need to publish data in a better way. For instance, today Google, Yahoo! and Bing understand semantic annotation (RDFa [1]) for a better indexation. The semantic web should be this heaven on which unstructured data can have a structured meaning and on which we can even infer new information. However, I am personally one of those people who have been disappointed by the semantic web. First of all because there is too many complex standards, I strongly think that it was one of the causes of the non adoption of the IMS standard. Secondly, because using unstructured data from the web involves that those data are semantically structured, i.e., linked with an ontology and exposed as RDF resource. The problem is that interesting information is usually not published with this kind of ontology, the best example is the failure of the FOAF [6] in Social Networks. The SN providers keep their data as it is their main valuable asset. That makes me think that at least for integrating external data, the semantic web will have a difficult adoption curve, but for processing internal data or integrating partner data, it has clearly a good chance.
However, I have to admit that new professional players arrived in the market [2,3,4,5] and that today a new application domain becomes an important user of semantic web, publishing and Media. That’s true that the basic piece of information in a news agency is a text, which is by nature unstructured.
SAP has taken this direction but in a more traditional approach. Instead of having a complete semantic triple store, they use internal semantic layers. The idea is to be able to model connections to data sources, queries, variables, projections, graphical report GUIs and to define semantic interfaces between them, like you would define interface in Semantic Web services. Then, they have a binding model which binds each layer with another. This approach enables SAP to promote re-usability of layers. Let’s imagine that you are looking for creating a new report for Global sales in Europe. The system can recommend you existing pieces of reports (existing bindings) for sales in West EU, Central EU and West EU, that you can aggregate for building your new report. SAP uses the Jena Framework [12].
Finally I have to highlight the data mining techniques must be re-designed for a pure web semantic. Indeed, the nature of the OLAP cube is thighly coupled with the underlying structure of relational tables. If we want to mine data exposed in RDF and OWL, we need to think about graph pattern extraction and new techniques rather than traditional OLAP and multi-dimensional table structure approach. The underlying graph nature of RDF must lead to more adapted extraction techniques.
Graph BI : There are mainly two approaches in graph BI. The first one is to extract DB content into a graph in order to discover new information [8]. The idea is that under a graph structure it is much more easier to navigate within the relation to compute metrics as the ones used in Social Network such as the closeness, the betweenness, the community detection, and many more mathematical SN graph analytic. In the same idea you can consider that, comparing to traditional BI on which the multi dimensional cube was the main focus, the graph extracted from your operational DBs would be the new focus as it becomes the new aggregated form, from which you apply your queries. Few companies use this approach for finding things that they cannot find as such in DB. For instance, KXen [17], extracts relationship graphs in Telcos CDR (Call data records) DB for evaluating the importance of a user and to avoid a churn propagation, but also in fraud detection by building a graph from the relation customer bought to merchant. This graph is used to build a predictive model that is used with CEP engine in order to detect a potential fraud.
The second approach is to model directly the data as a graph. The main advantage of this technique would be that the design of the meta model is easier as it represents more naturally the world as you just model the relations as they exist.
However, I think the graph BI has currently several weak points such as the query language, the integrity of data when you add a new information node, and finally the scalability. Today, almost every graph DB or algorithm libraries are made for working on a single machine. Few researches have been working at finding innovative answers but they still need time to come with industry ready answers.
BI Requirement engineering [9, 10, 11]: Defining the requirement of a BI project is perhaps the most difficult things. A lot of researches are focusing on formalizing the requirement phase according to strategic objectives and operational configuration. The final goal is to be able to express the objective and let the system built the OLAP cube, the queries, the dashboard and all other elements needed in the project. In addition, we should be able to change or update the requirement and to calculate the impact on the entire architecture.
Real-time BI: This topic is related to the ability to take a decision in real-time with fresh and up to date data. There is basically two main problems: (1) how data are captured and (2) how can we minimize the processing time. About data acquisition, the main focus is on optimizing data freshness. In this case, the improvement resides in the real time ETL and the way we can make parallel, short and frequent extractions.
I did not see any approach recommending pushing data instead of extracting data. Similarly to the Facebook approach for the like button, we can receive data from streams [15] and correlate/pre-process data before storing them in a first stage storage. This approach is more a Push Transform Store and Load. This is clearly a way to consider as soon as we speak real-time data and processing.
The second issue focus on processing improvement for which we can find two main approaches. The first one proposes distributed processing frameworks for executing the mining, such as Map Reduce or distributed data flow processing with the notion of data affinity (executing processing where the data is). Then we fall in the previous topic that I have presented in the web scale BI. I have to admit that this architecture is particularly well suited for cloud and elastic environment. This is the typical approach taken by Google, Facebook, Twiter and internet world in general. The second approach is more focused on only one central big machine and is much more closer to HPC. This is the approach from SAP with the HANA project [13] in which the in-memory software is highly optimized for the underlying HW (HP, IBM, Fujitsu) or the Exadata from Oracle [14].
BI as a Process : The main idea of this research field is to be able to model the BI activity as a process, but also to consider a complete data flow process. Comparing to a tradition ETL, OLAP and queries approach, this new view enables to get a complete overview of the data transformation process and to completely integrate it within a BI BPMN process. The researches led in Brussels [16] focuses on the BPMN modelling of the first stage of this process, the ETL process. In that case the data transformation is not a just a small bloc hiding the operation any more, but a real data flow process that can work with a control process. At the end of the day this can of researches can provide a BI process mashup environment.
References
[1] The RDFa wikipedia page, http://en.wikipedia.org/wiki/RDFa
[2] OWLIM, http://www.ontotext.com
[3] AllegroGraph, http://www.franz.com
[4] Bigdata, http://www.systap.com
[5] Dydra, http://dydra.com
[6] FOAF project, http://www.foaf-project.org/
[7] http://bnode.org/blog/2009/07/08/the-semantic-web-not-a-piece-of-cake
[8] R. Soussi, Graph Database For collaborative Communities, In Community-Built Databases:Research and Development, Pardede, Eric (Ed.) 1st Edition., 2011, 400 p., Hardcover ISBN: 978-3-642-19046-9, Springer,Due: May 2011
[9] J. Camus Golfarelli Matteo, S. Rizzi, J. Trujillo, Data Warehousing Visual Modelling of Flows with UML Profiles. DaWaK 2009: 36-47
[10 ] J. Camus, Jose-Norberto Mazon, J.Trujillo, Model-Driven Metadata for OLAP Cubes from the Conceptual Modeling of Data Warehouses. DaWaK 2008: 13-22 2011
[11] J. Mazon, J. Trujillo, Jens Lechtenbörger, Reconciling Requirement-driven data warehouses with Data sources via multidimensional normal forms. Data Knowles. Eng 63 (3): 725-751 (2007)
[12] The Jena Web Semantic Framework, http://jena.sourceforge.net/
[13] SAP In-Memory appliance, http://www.sap.com/platform/in-memory-computing/index.epx
[14] Oracle Exadata, http://www.oracle.com/us/solutions/ent-performance-bi/index.htm
[15] S4 the distributed streaming platform, http://s4.io/
[16] Z. El Akkaoui, E. Zimányi, Defining ETL worfklows using BPMN and BPEL. DOLAP 2009: 41-48
[17] The KXen web site, http://www.kxen.com/