Sergey Sheinblu...'s profileSergey's BI SpaceBlogLists Tools Help

Sergey's BI Space

No list items have been added yet.

Sergey Sheinblum

Occupation
Location
June 03

SAP goes with Eclipse open source

Amazon works with Eclipse so as SAP trying to put dedicated resources to further development of Eclipse open source. Interesting to hear how MS would be responding to BI/DW open source initiatives....

May 20

Vendors for Large Scale Analytics - part 2

Mr.Wayne Eckerson, director of TDWI (the data warehouse institute) Research had published back in october 2008
an article Beyond Reporting:Requirements for Large Scale Analytics.
To addition of my blog from october 2008 pls find some other vendors that do provide some solutions to help scale processing of  large volumes of data and build DW repositories to query the data:
 " Types of Analytical Platforms :
The most innovative sector of the business intelligence industry has been among database vendors, both new and old, that have shipped almost two dozen new products in the past year designed to accelerate query performance on large volumes of data.

Here is a high-level categorization of these products.
MPP Analytic Databases—Specialized, stand-alone databases designed to run on MPP hardware and accelerate query performance. Examples: Aster nCluster, DATAllegro (now owned by Microsoft), Greenplum 3.2, IBM DB2, Kognitio WX2, Teradata 12.0
Data Warehouse Appliances—A purpose-built machine with preconfigured MPP hardware and software designed for analytical processing. Examples: Dataupia Satori Server; Kickfire Analytic Appliance; Hewlett Packard NeoView; IBM InfoSphere Balanced Warehouse; Netezza Performance Server; Oracle Optimized Warehouse (with various hardware vendors); Teradata 550, 2550, and 5550 machines; Greenplum; and Sun’s Data Warehousing Appliance
Columnar Databases—Store data in columns instead of rows, allowing greater compression and faster query performance. Examples: InfoBright Data Warehouse, ParAccel, Sybase IQ, Vertica
Complex Event Processing Systems—A system that captures and analyzes real-time streaming data. Examples: Cognos Now! SeeWhy, Streambase, Syndera, Truviso "



I'd be very glad to hear that parallel execution of the query is SOLVED by such and such algorithm and ...industry can move on with developing vendors products to build software solutions to scale large data volumes (peta bytes of data) EASSILY...
 .but...so far...multiple vendors...multiple choices...multiple decisions...  data architects need to be very creative and flexible in implementing any of existing custom solutions to scale data processing and optimize queries to produce DW answers for Monitoring, ROI, Campaign/Sales management, financial reporting for periodic analytics, user behavior , fraud analysis , forecasting and etc...
Or maybe to dig out Object oriented databases and build QBE (query by Example) DW repositories/applications? or in Memory dbs/repositories ? Architects would decide what 'cocktail' of technologies to use...based on Client's needs.


May 12

Why Isn’t Predictive Analytics a Big Thing By Eugene Asahara

Many companies would like to have Forecasting as part of DW/Data Mining efforts but ...very few can define shapes of what is
Data Mining and what tools can help with building Analytics and Predicting analytics as part of TARGETing and ROI efforts.
Eugene Asahara, who I had a chance to work indirectly back in 1999, has posted very good article about MS data mining products and possible DM architecture.



November 30

Internet companies and Data Crisis

After research on start up companies in Bay Area that are developing social networks,  Ad Serving, Analytics on Web Usage, Targeting and SEO,Web2.0 (SAAS) applications I've decided to put some comments that on my opinion defines situation such as 'Data crisis'.
Start up companies do not put much attention to modeling the abstract concept of business. Why?
First goal is to grow user base, i.e. is to build quick interfaces for users to get as many users as possible, as many pages as possible, especially if company is making $ per CPM adv.model - more web pages, more potential 'ad displays'.
During this growth of user base the requirements on data accuracy, single point of truth, uniqueness of records and overall data processing as well as data modeling are completely neglected.
That leads to redundancy, inaccurate data, duplicates, ineffective data processing, wrong (quick not well thought and prepared to be enhanced without 'redeveloping' but ADDING the features) data solutions, inability to organize normalized data processing, inability to make accurate analysis to increase ROI and etc...
Cost and time to market drives 'dirty' development and complete lack of normal form of data representation.
At best data models are supporting UI functions which leads to highly de-normalized data flows and redundancy of data storages.
Next step in the game is that company can not enhance the features , only AD features.
Therefore, leads to additional de-normalization.
The next step is that company is starting to ask questions - what if? which feature is more productive for business? who are the users? what users are doing better than others? how can we group and create categories for products, for users? How better to target marketing campaigns to increase ROI? and etc...
Basically, deep analysis need to be made using data that had been collected.
In best scenario, the history of records had been preserved but in most cases it is not.
Therefore, starting the 'task' of collecting history in order to make analysis based on time series (periods) would drive to have MAster Data (uniquely identified records for major Conceptual Entities of the business). In other words the data model
need to be built in highly normalized way.
The question is arising
whether to create additional data repositories to extract Master Data and develop solutions for LOOKUP, data staging, data cleansing, data mapping, data matching, data formatting ?
OR
just have temporary solutions to run certain queries against denormalized data repositories (mostly in MySQL, ORacle, SQL server) in order to extract data as one time snapshots?

Most likely the next step -  the company will be continue operations hiring more and more developers, than contractors   , and finally, to outsource the projects to India-China-Russia, and ...if no VC funds can be secured to continue the support the applications -than most likely it will be slowly bleeding on $ and soon be going 'belly up' as we say.
Why? because the cost of supporting such 'Data Crisis', in other words - data mess, is becoming so expensive and unproductive that
customers will be not satisfied with services or results and will be going toward better sites -competitors.
So what is the Data Crisis by my definition?
Lack of normalized data model, lack of understanding and visibility of Conceptual Data Representation (entity model) of abstract model of doing business, lack of common/reusable modules, high inaccuracy of data (>35%),inability to separate (in logical and physical way) transaction data and data processes from aggregated data (analytics, OLAP, BI, DW), redundancy of data processings and data repositories and etc...
Some startups got lucky to be sold at such times to large corporations but most companies are discontinuing operations...

That situation is happening first to Internet social networks, it will be happening for businesses that are operating with very large databases ...for different verticals or industries in different time ...but it is here now...
Take a look at social networking for example. huge potential to collect data and make it work for improving sales, analysis of users, their behavior on internet (beside porn and stock market) and etc...
But most social networks are surviving on the only adsence business - display advertisement by Google.
Which is definitely sufficient revenue model if you have millions of views but no room for new business models based on analytics and research on collected data.
Why? because they can't process data and store data in normal form of conceptual business model.
 Companies don't pay attention to data processing,data modeling, data architecture until the data is so large that it becomes inoperative.
I have researched 169 companies for last half a year in bay area and San Francisco.
Talked, chatted, met 27 start ups asking questions/researching/going through two ways interviews, trying to find out what company did paid attention to data platform and did architect back end operations in such way that business can grow, data can be scaled,  data repositories with aggregates can be 'ready' to provide data analysis.
2 companies that I have researched , Quantcast and Zvents, went right direction (my personal opinion).
They have started to build CUSTOM solutions for data scaling using Google's direction - MapReduce mechanizm for parallel ETL and DFS (hadoop or kosmos) to store data.
These companies have clear understanding of need to develop DATA PLATFORM 'in-house' to scale, store, query data with volumes in a range of 1 billion+ transactions a day.
Why only 2 of 169?!!!
I guess because of desire to make it 'quick' and sell the business.
I talked to one very popular company that runs apps on facebook, owned by former PayPal guys.
The hundreds millions of users can't be 'converted' to hundreds millions of $.
Why?
For me is clear - lack of custom built data platform, lack of possibility to build aggregation repositories to make time series analysis and improve marketing, decision making process on features to be added or discontinued and etc...
Vp of marketing had mentioned:" What for to build data warehouse <data platform> ? we can ad-hoc data . Ebay (paypal) did tried to build DW but ended up with 2 people using it".
Actually it does indicate a 'data crisis' at certain project at ebay, as large corporation.
It shows the lack of data platform that can be integrated, records can be uniquelly identified, data can be scalled in raw format (at it is captured) and etc...therefore, in the end there are only 2 DW customers-users...
It again and again proves the case that neglect of data models, master data, normalized data processings leads to inability to analyse data , therefore, inability to build DW to monitor/report on what is going on with business and how to improve the ROI.
 
Unfortunately, I've been hearing 'ad-hoc' ideas quite often because of lack of data platforms and lack of developed technologies built inhouse to query the aggegated repositories.
Let's  build queries and than we'll see.
Again it is 'Data Crisis' too.
Instead of investing into data platform to be scaled and be able to grow by volumes, just allocate some $ on ad-hoc database software engineers to 'twist the data queries'.
It will be growing like a snowball by resources and 'spaggetti' of code and data... in 3 years I doubt that anybody from management or owners will be able to get accuracy of how to improve the business and get more revenue...
So would the company goes with ad-hoc queiries on growing tera bytes of data? - good luckSmile.
Most likely, first time for analysis will be reduced, than vertical partitioning will help, and in the end cross functional analysis-queiries just won't be possible to run, therefore, very limited visibility of what is going on in business not mentioning lack of any
analytical models for forecasting or profiling or etc...
That does give an example of how 'Data Crisis' situation can be also created 'by design' as a result of lack technical vision or incorrect interpretation of 'bad' experience.
I'd like to give simple example of components that supposed to be a part of architecture but most start up companies have neglected
the abstract principles of architecture ...and therefore, have been paying the high price for it...basically , in my words are in a mess of data and data operations that are defined here as 'data crisis'...

 Following are simplified architecture approach to build the web based applications:
Let's say it is for marketing-advertisement business on internet or hosting SAAS applications.
Let's define simple bricks-components of software foundation that need to be architected into software application:
1.Front end (Browser-client components)
2. middle tier business components (server side to support UI - front end components)
3.application server components ( business rules execution, data connections,data integration, data collection, queries,data returned by  queries, data manipulation to serve front end or back end (ODS),),
4.back end components (data repositories to process and organize (model) data for business needs for TRANSACTIONAL processing)
5.DW/BI,  analysis/data mining components (aggregated data to provide business analysis and reporting).
I've simplified a little bit definitions.
The point is there are 4 of 5 SYSTEM COMPONENTS that are representing data repositories and data processing parts of any business in internet advertisement.
Therefore, architecting front end (UI) component and not architecting /modeling data for four other business software layers is a huge mi-stake.
But unfortunately,in most startups the cost and other factors such as
 time to market requirements
 frequent change of UI
frequent change of features
lack of architect role/position as a 'gate keeper'
lack of resources to develop and test
etc...
 lead to simplifying architecture to Front end and Back end components.

Therefore, data is denormalized based on UI funtions as well as all data processing are partitioned based on
UI functionality. Data processing, data analysis, cross functional reporting, data mining and etc... is becoming very much
challenging if not impossible tasks.  Lack of abstract normalized data model (sometimes it is called master data for business) really is the 'data crisis'.
This is the #1 problem.
Lack of software Architecture based on abstract business model and abstract normalized data model.

On data side this problem brings inability:
 to optimize data processing
 to speed up transaction processing
 to normalized and optimized data storage repositories (DBs or Filers) for OLTP
 to normalized data (star schema) for analysis and data mining (forecasting) as well as fraud protection analysis
to integrate external data sources
to modify and enhance
to make crossfunctional reporting
to monitor data and performance of business
etc....
it brings snowball of cost to support not mentioning inability to grow business by adding 'new features'....

yes, lack of data architecture and data modeling by design is the 'data crisis'.
 
The #2 problem that can define DATA crisis is the lack of technologies to scale the large volumes of data.
When business is growing the larger data volumes need to be processed and need to be stored. That will be requiring 'special treatment' from architect to come up with platform that can scale data and at the same time have met the requirements on  performance to query data repositories.
The issue is that each of software components need to be 'ready' , i.e. need to be architected for scale and fast query execution.
In most start up companies it is not the case by many reasons...some defined above.
The #3 problem is inability to foresee or to accept and deal with Problems #1 and #2 .

I did not have much luxury to step into company when the development from scratch had started.
Most of the time I have served as 'fireman' for companies that are in 'data crisis' situation.

Cover or fix 'data crisis'?
Hire contractors to do the hardcoded solutions?
Hire contractors to blame for failure?
Fire / hire full time employees?
Restructure/fire/hire  managers ?
Start redevelopment as NEW project by adding New data platform development group?
Hire more managers and developers for a permanent positions to continue support the snowball of problems?
Start new development and have a strategic plan to move 'old' business flows into NEw one as a step by step?

Restructure the groups and setting up ownership for certain features/applications/systems?
Outsource to India or China to reduce cost?
Outsorce support and start new development hiring or retraining resources?
Start building data platform/technology inside the company to secure next 4-7 years business growth?

all of the above has been happening in industry...
what are the tradeoffs or temporary solutions in such 'Data Crisis'?

Some companies come up with architecture to pre-aggregate data to reduce volumes therefore,
losing raw data that would be needed for analysis and clear understanding of business success.
It also causing the inability to compete with other companies by adding new features based on analysis of raw data.
Example are:
 'targeting' of the ads based on user's behavior.
forecasting
monitoring
etc...
If not raw data has been saved in data repositories - no analysis can be done, no 'targeting' feature can be applied to get more $ and improve customer satisfaction and etc...
Pre-aggregated data will create lots of challenges, as well as adding the cost to keep track of data relationship per transaction - user action.
Example, when data is pre-aggregated on account (advertiser) level to calculate amount of money that is left to continue marketing
of ads by publisher, than the analysis on what/who/where(geo/demo) does view the ad will be not possible to do as raw data per user action (transaction) will be aggregates to level of account.
which is higher hierarchy level than ad...(company-->account-->campaign-->order--> orderitem--> ad)
etc..
 The lack of business data model hierarchies and lack of having normalized data model will cost lots of money in the long run...
some companies - tens millions, some companies - hundreds millions. I work for company that have spend 1 billion for 4 years and still had failed in the end...losing money....guess which one?
 Very expensive data processing, staging, mapping, matching, cleansing software solutions need to be built as a temporary tradeoff.
But in a long run - the problem of keeping raw data for analysis and optimization to process the data /aggregate data won't disappear.

Some companies would use temporary staging repositories and place raw data as daily snapshots for queries that need to be performed for analysis, basically 'hardcoding' the elements that are needed to be stored in a separate repository....
Therefore, those repositories can be ad-hoc(ed) to do on fly data analysis for fixed time period (day or month).
Still , full analysis of data to compare different KPIs based on different time series is impossible.
It required lots of work and investment to keep those queries running as data is growing and again, tradeoffs need to me made :
limiting time series to query data, partition data to allow longer history to be queried ,etc...
lots of work for ETL and database engineers to keep it going...
etc...

I can continue to list issues and temporary solutions ...long list...and each time unique implementation...  that I have been participated when dealing with 'Data Crisis' situations...
But the point of the article is -
 why to create this grounds for data crisis on the first place?
when any rational is saying - find right architect, spend $ on architecting, modeling, developing in-house data platform, and ...you will be in win win situation anytime after 6 months of development effort (life cycle for first versionSmile)...
What to do to avoid DATA CRISIS versus FIXING it ?
Invest from the begining into DW and data platform . Simple? yes. Doable? yes.
Design - Build Distributed Data Processing Platforms or use vendors to scale large volumes of data.
Design - build distributed DW solutions or use vendors.
FROM THE VERY BEGINING OF business.
Cloud computing, or Grid of Computers , is new technology to scale and to speed up queries but the technology (list of vendors and what they up to Smile in my previous blogs) has not been developed to satisfy any customer/application with single vendor's software package.
There is no 'perfect' mathematical algorithm that had been developed  so far to optimize Parallel execution of the query but
there are plenty of working solutions that can be a starting point for your data platforms.
But keep in mind that problem #2 need to be dealt by ARCHITECTS, experienced with very large volumes data processings solutions.
 No 'golden egg' Vendor software would be found nowadays to solve the Problem #2 or to build single technology solution for OLTP and DW data repositories.
  What to do?

simple suggestions:
Hire experienced (hands-in) Data Architect/technology visionary to review each application flow and start building the plan to reorganize the data operations/ data platforms to be ready in 4 years to stay in business with nice figures on ROI.
Develop in-house data processes (data platform) for transactional (raw) data.
Develop unified data model based on abstract normalized data model for your business
IMprove data processes based on this Unified (master data) model
identify critical data processes (data elements) to allow company to grow
Start building custom data analysis systems (bi, monitoring, dw,reporting) based on
   - requirements on latency
   - frequency of change for data elements (richness)
  - dynamicity of hierarchies (relationship between entities and data elements)
  - etc..

What technologies to use for data to be scaled?

As we all know, it is up to architect and team to make a decision and mitigate the risks.
each business is very different by IT situation and business rules.
But start from DATA.
Conceptual data model, logical data model and data flows will help tremendously to build the optimum solution.
There too many vendors that are currently trying to fill up 'scale with computer grids', i.e. distributing processing, distributed DW, distributed networks...
Do research and feature comparison before any steps to use any of vendors.
Several open source projects that can help to start building the distributed data platform based on Map Reduce mechanizm.
Some DFS systems are already in use but open source does have mostly very basic start to build on top - Hadoop(java based), Kosmos (C++ based)and etc...
I have found/researched a list of vendors and open source players in distributed - Cloud computing space and put it in my previous blogs couple months ago...

What is the working architecture that I'll be building for more than 1bln transactions a day with low latency reporting requirements (10 minutes or less)?
I'd try to build on top of Kosmos and Hypertable for data processing , aggregate data and put into distributed file system to be batched into vendors' OLAP products (Oracle, MSAS, SAP) to query the aggregated data.
Again, some vendors like Greenplum, Teradata, Informatica, Oracle might serve as a starting base for your DW and ETL needs.
Ideally, i'd like to see the Framework (metadata driven) to support parallel execution of jobs having failover mechanizm, mechanizm to support late arriving data, mechanizm to build que and change priorities of the que on fly, mechanizm to work with more than one cluster (all monitoring features for network traffic and distribute load on nodes not only on one cluster but on several) and etc...
I do have a full list...that I'd like to continue to work on...
Good news is that financial crisis gives an opportunity to slow down on throwing money on startups that I have mentioned above
and may be consentrate time and effort on developing the technologies for data processing on distributed networks and , it would be top of the line , to find/develop the algorith for parallel query execution as the base for all distributed data processing.
 
So far I'd say that we are in transition period for technologies that need to deal with huge volumes of data.
Data Crisis is bad news for any company.
But at the same time 'Data Crisis'  is good news for industry as it will drive, and it has been already happened, the progress in technology of parallel query processing and distributed data operations.
It is cool to be data architect and  come up with solutions to challenge DATA CRISIS situations.
 







November 07

Financial Crises - What to do?


Media and news are very controversial and misleading in explanation of current Critical condition of USA Credit Market collapse.
I found that this article , by Igo Baskin includes some interesting simple examples and  explanation how credit system had supported pyramids with no 'real money' secured by FNM , or basically US government. Unsecured loans and etc...Collapsed credit system need to be replaced not mentioning global crizis. Part of which is that the investment into US banks and government papers considered to be secure by other countries national banks. However, it is not secure buying paper from institutions that are broke by issuing not secure loans. etc...
 I am not sure that I agree completely on  'What to do?'  suggestions but overall I like the simplicity of  presenting the info in the article.
The article is a little bit scary ,with no hope, cruel but rational and realistic -  that is what Russian mentality is about.
Small thing though - you have to read in RussianSmile

Мы вступаем в полосу глобального экономического кризиса. Для подавляющего числа людей это будет время крушения привычного, достаточно комфортного, образа жизни и огромных материальных потерь. Катастрофически упадёт в цене недвижимость, пропадут все деньги, вложенные в ценные бумаги, растают как снег на весеннем солнце пенсионные накопления.

Еврейская поговорка:"Лошадь сдохла - надо слезть."
Казалось бы все ясно, но...
не надо уговаривать себя, что есть еще надежда
не надо бить лошадь сильнее
не поможет то ,что "всегда так скакали"
не надо оживлять дохлых лошадей или организовывать мероприятия по их оживлению
не надо собирать аналистов , чтобы проанализировать дохлую лошадь
не надо "озивлять", что умерло
не надо нанимать специалистов, которые помогут сдохнуть другим лошадям
и т.д.
ЛОШАДь СДОхла!!!

ПОРА СЛЕЗТь...
Angry

New global financial system need to be built. Old one has died.
get rid of 'dead horse' and grow up a strong young one - that is the conclusion of Mr. Baskin's article.
Start growing up the young financial system...
Confident that it will be happening SOONER than middle class in USA starts to melt down ...