Today, every company is a data company and hence every person is a data professional. You could be a credit manager, accountant, salesperson, HR manager, or engineer and still work on data and derive insights. This position, where any business professional uses data and analytics models to drive insights related to their business domain is known as a citizen data scientist (CDS). We prefer the term citizen data analyst (CDA) as knowledge workers can apply both an art and a science to their interactions with data.
According to Gartner, “a citizen data scientist is someone who creates or generates models using advanced diagnostic analytics or predictive and prescriptive capabilities, but whose primary job function is outside the field of statistics and analytics” [Idoine, 2018].
But how does a CDS/CDA “power user” leverage data and analytics and derive insights for decision-making? What tools and capabilities will empower the CDA? Fundamentally, a successful CDA will leverage three key elements to deliver insights for improved business performance: quality business data, a robust self-service analytics (SSA) platform, and a strong data and analytics governance process. These three capabilities offer the potential to address increasingly advanced data analytics needs, putting more power in the hands of the business users to get critical questions answered on demand. While the CDA capabilities of every organization are different, there are still common elements or solutions patterns that will be applicable to all organizations.
Analytics in Context
Firstly, what are the solution patterns for quality data in analytics? When it comes to quality data, what is required for good insights is the right data and not necessarily more data. The right data in an analytics context has three main characteristics [Southekal, 2021].
Fundamentally, analytics is using data to answer the right questions about the future state. Invariably, insights derived are dependent on the response (effect) and explanatory (cause) variables, and these variables are known as features or dimensions. Basically, dimensions provide context on the measures such as price, quantity, and cycle time associated with the business process.
Up to 80% of the data captured in business enterprises is unstructured data. Examples include documents, video, audio, images, and so on. Unstructured data is of little value for the analytics algorithms as the unstructured data do not have a predefined data model, which is required for analysis and data processing.
Business processes inherently have some degree of variation, and this variation is reflected in the data captured. Variation in data makes it difficult for the analytics algorithms to make timely and accurate predictions.
The Self-Service Analytics Platform
Now, comes the 2nd component in enabling CDA – the Self-Service Analytics (SSA) platform. Basically, the SSA platform enables business professionals to perform queries and generate insights with minimal IT support. A robust SSA platform in the analytics and CDA context should support these key features.
An analytics platform is only as valuable as the data that is available to it. Hence the SSA platform should easily connect to the existing data sources whether it is the canonical database (like the data warehouse) or the systems of record (like ERP or CRM). Regardless, the data sources whether on-premise, cloud, or on a hybrid-cloud, the SSA platform should support easy management of data indexes (for efficient searching), data loads, and data refreshes.
Data Quality and Freshness
Getting useful and accurate insights depends on the quality and freshness of the data. Both of these are threatened if data lives in impenetrable silos. Without good data quality, the veracity of insights is threatened. Without data being reasonably fresh, we may be making assumptions about the state of the world based on an older version of that world.
Performance, Scale, and Concurrency
An SSA that is unusable because of slow response times and frozen dashboards is not viable. A true CDA wants to use data to follow and prove or disprove their insights and intuitions about the world they are analyzing. DA can do quick data exploration and retrieve the piece of data they want.
SSA doesn’t mean less or no security; governance is a key prerequisite for successful SSA and CDA. SSA platforms should support authentication of the CDA with IDM (Identity Management) solutions and RBAC (Rol Based Access Control) to ensure that access to sensitive data such as PCI DSS (Payment Card Industry Data Security Standard) and PII (Personally Identifiable Information) is controlled and governed.
Analytics solutions depend on acquiring data from diverse systems. Given that the definitions of these data elements vary, often there is a pressing need to offer a semantic or meaningful representation of data. Semantic models depict the relationships that exist among specific values of data [Luisi, 2014]. Hence the SSA should help CDA leverage a centralized semantic model so as to establish a single source of truth (SoT) for generating accurate and timely insights.
The SSA platform should have an extensive library of time-tested analytics algorithms including access to open-source libraries such as TensorFlow, Keras, scikit-learn, and more. This will make it easy for the CDA to reuse existing analytics algorithms instead of building their own solutions from scratch.
Lastly, CDA will not be empowered without the right governance processes. While there is no denying that CDA are powerful, it is also just as important to recognize that the CDA enablement needs to be managed with a strong governance framework. The governance framework should identify data ownership, role evaluation, training on data literacy, optimize queries, pre-compute results, flag unused reports and dashboards, monitor system performance, and other regulatory and data management activities.
So, what is the solution that brings together all three components i.e., quality business data, a robust SSA platform, and a strong data and analytics governance process to enable successful CDA? Kyligence — powered by Apache Kylin — provides a holistic analytics platform by securely integrating data from various data sources to create a clean (right data), integrated, and semantic database for the CDA to derive powerful insights in near-real time. Kyligence also accelerates the productivity of CDA by automating data discovery, data integration, and offering low-code/no-code analytics libraries for seamless and secure insight generation.
Kyligence and the Citizen Data Analyst
Kyligence is a great enabler of citizen data analysts. Here is a brief summary of its advantages in creating an SSA that enables greater adoption and success with large-scale analytics.
Kyligence streamlines data acquisition and enables multi-cloud deployment by supporting leading data platforms such as hadoop, RDBMS, data warehouses and data lakes.
Kyligence can serve high-quality data from data platforms as well as data from real-time streaming platforms like Kafka to enable a hybrid analytics model that includes both batch and real-time data sources. With the Unified Semantic Layer, a CDA can get standard definitions of dimensions and metrics to get a single source of truth of data.
Performance, Scale, and Concurrency
With the combination of Apache Kylin (distributed cubing) and ClickHouse (MPP), Kyligence delivers high performance for the vast majority of analytical queries, detailed queries, and ad hoc exploration. CDAs can do quick data exploration even on extremely large datasets.
Kyligence provides cell-level security, to control the access of data in the backend, and make it transparent to the users. Kyligence also provides identity and access management services that can integrate with user management systems like LDAP and Active Directory as well as role-based access control to secure the collaboration.
Kyligence features a Unified Semantic Layer to create a uniform semantic model for different BI teams. When the schema of a data source changes, the data model in Kyligence will evolve adaptively to keep the structure and consistency of data and upper applications.
Kyligence provides standard ANSI-SQL and XMLA/MDX interface, which can be easily integrated with existing analytics tools, like Tableau/Excel/PowerBI, and also data science languages like Python or Scala to build an end-to-end machine learning pipeline with TensorFlow, scikit-learn, and more.
With the Kyligence semantic layer, each data model is a governed data mart automating and simplifying governance operations such as auditing and rating.
Auditing - Since data models are the basic elements for users to use, the admin can track the usage of each data model easily.
Rating - Kyligence admins can see the size of storage for each model, they can easily view usage vs. storage for each piece of data, identifying hot data models, which often are the most valuable.
Data Lifecycle - It's easy to manage the lifecycle operations for data models, including creation, ingestion, refresh, merge.
The built-in AI-augmented engine will recommend more valuable indexes to build, and detect useless indexes to remove, to reduce the cost of storage and computing resources.
In today's digital and data-centric economy, analytics is a key enabler that transforms data into a business asset by providing the insights for sound decision-making. Sadly, most analytics projects have focused on centralized data science teams to offer business insights, and the result is over 80% of analytics programs have failed to offer business benefits [Miranda, 2018]. This approach has not only delayed the consumption of insights, it has also increased the cost of transforming insights into appropriate business actions. The future of deriving value from data and analytics is to empower the CDA as it will reduce the cycle time, save costs, and improve customer service for organizations. However, the CDA “power user” has to be positioned for success and such positioning requires enablement of quality data, a strong governance process, and an easy-to-use SSA platform like Kyligence.
- Idoine, Carlie, “Citizen Data Scientists and Why They Matter”, https://blogs.gartner.com/carlie-idoine/2018/05/13/citizen-data-scientists-and-why-they-matter/, 2018
- Luisi, James, “Pragmatic Enterprise Architecture”, Morgan Kaufmann, 2014
- Miranda, Gloria Macías-Lizaso “Building an effective analytics organization”, https://www.mckinsey.com/industries/financial-services/our-insights/building-an-effective-analytics-organization, 2018.
- Southekal, Prashanth, "Analytics Best Practices", Technics Publications, 2020
Dr. Prashanth Southekal is the Managing Principal of DBP-Institute (www.dbp-institute.com), a Data and Analytics consulting and education firm. He has consulted for over 75 organizations including P&G, GE, Shell, Apple, and SAP. Dr. Southekal is the author of 2 books - Data for Business Performance and Analytics Best Practices, and writes regularly on Data, Analytics, and Machine Learning in Forbes.com and CFO University. Apart from his consulting pursuits, he has trained over 2,500 professionals worldwide in Data and Analytics. He is also an Adjunct Professor of Data Analytics at the University of Calgary (Calgary, Canada) and IE Business School (Madrid, Spain). He has a Ph. D from ESC Lille (FR) and an MBA from Kellogg School of Management (US).
Dong Li is the Founding Member and Senior Director of Product and Innovation at Kyligence, an Apache Kylin Core Developer (Committer), and member of the Project Management Committee (PMC) where he focuses on big data technology development. Previously, he was a Senior Engineer in eBay’s Global Analytics Infrastructure Department, a Software Development Engineer for Microsoft Cloud Computing and Enterprise Products, and a core member of the Microsoft Business Products Dynamics Asia Pacific team where he participated in the development of a new generation of cloud-based ERP solutions.