| Sergey Sheinblu...'s profileSergey's BI SpaceBlogLists | Help |
|
June 03 SAP goes with Eclipse open source Amazon works with Eclipse so as SAP trying to put dedicated resources to further development of Eclipse open source. Interesting to hear how MS would be responding to BI/DW open source initiatives.... May 20 Vendors for Large Scale Analytics - part 2 Mr.Wayne Eckerson, director of TDWI (the data warehouse institute) Research had published back in october 2008 an article Beyond Reporting:Requirements for Large Scale Analytics. To addition of my blog from october 2008 pls find some other vendors that do provide some solutions to help scale processing of large volumes of data and build DW repositories to query the data: " Types of Analytical Platforms : The most innovative sector of the business intelligence industry has been among database vendors, both new and old, that have shipped almost two dozen new products in the past year designed to accelerate query performance on large volumes of data. Here is a high-level categorization of these products. MPP Analytic Databases—Specialized, stand-alone databases designed to run on MPP hardware and accelerate query performance. Examples: Aster nCluster, DATAllegro (now owned by Microsoft), Greenplum 3.2, IBM DB2, Kognitio WX2, Teradata 12.0 Data Warehouse Appliances—A purpose-built machine with preconfigured MPP hardware and software designed for analytical processing. Examples: Dataupia Satori Server; Kickfire Analytic Appliance; Hewlett Packard NeoView; IBM InfoSphere Balanced Warehouse; Netezza Performance Server; Oracle Optimized Warehouse (with various hardware vendors); Teradata 550, 2550, and 5550 machines; Greenplum; and Sun’s Data Warehousing Appliance Columnar Databases—Store data in columns instead of rows, allowing greater compression and faster query performance. Examples: InfoBright Data Warehouse, ParAccel, Sybase IQ, Vertica Complex Event Processing Systems—A system that captures and analyzes real-time streaming data. Examples: Cognos Now! SeeWhy, Streambase, Syndera, Truviso " I'd be very glad to hear that parallel execution of the query is SOLVED by such and such algorithm and ...industry can move on with developing vendors products to build software solutions to scale large data volumes (peta bytes of data) EASSILY... .but...so far...multiple vendors...multiple choices...multiple decisions... data architects need to be very creative and flexible in implementing any of existing custom solutions to scale data processing and optimize queries to produce DW answers for Monitoring, ROI, Campaign/Sales management, financial reporting for periodic analytics, user behavior , fraud analysis , forecasting and etc... Or maybe to dig out Object oriented databases and build QBE (query by Example) DW repositories/applications? or in Memory dbs/repositories ? Architects would decide what 'cocktail' of technologies to use...based on Client's needs. May 12 Why Isn’t Predictive Analytics a Big Thing By Eugene Asahara Many companies would like to have Forecasting as part of DW/Data Mining efforts but ...very few can define shapes of what is Data Mining and what tools can help with building Analytics and Predicting analytics as part of TARGETing and ROI efforts. Eugene Asahara, who I had a chance to work indirectly back in 1999, has posted very good article about MS data mining products and possible DM architecture. November 30 Internet companies and Data Crisis After research on start up companies in Bay Area that are developing social networks, Ad Serving, Analytics on Web Usage, Targeting and SEO,Web2.0 (SAAS) applications I've decided to put some comments that on my opinion defines situation such as 'Data crisis'. Start up companies do not put much attention to modeling the abstract concept of business. Why? First goal is to grow user base, i.e. is to build quick interfaces for users to get as many users as possible, as many pages as possible, especially if company is making $ per CPM adv.model - more web pages, more potential 'ad displays'. During this growth of user base the requirements on data accuracy, single point of truth, uniqueness of records and overall data processing as well as data modeling are completely neglected. That leads to redundancy, inaccurate data, duplicates, ineffective data processing, wrong (quick not well thought and prepared to be enhanced without 'redeveloping' but ADDING the features) data solutions, inability to organize normalized data processing, inability to make accurate analysis to increase ROI and etc... Cost and time to market drives 'dirty' development and complete lack of normal form of data representation. At best data models are supporting UI functions which leads to highly de-normalized data flows and redundancy of data storages. Next step in the game is that company can not enhance the features , only AD features. Therefore, leads to additional de-normalization. The next step is that company is starting to ask questions - what if? which feature is more productive for business? who are the users? what users are doing better than others? how can we group and create categories for products, for users? How better to target marketing campaigns to increase ROI? and etc... Basically, deep analysis need to be made using data that had been collected. In best scenario, the history of records had been preserved but in most cases it is not. Therefore, starting the 'task' of collecting history in order to make analysis based on time series (periods) would drive to have MAster Data (uniquely identified records for major Conceptual Entities of the business). In other words the data model need to be built in highly normalized way. The question is arising whether to create additional data repositories to extract Master Data and develop solutions for LOOKUP, data staging, data cleansing, data mapping, data matching, data formatting ? OR just have temporary solutions to run certain queries against denormalized data repositories (mostly in MySQL, ORacle, SQL server) in order to extract data as one time snapshots? Most likely the next step - the company will be continue operations hiring more and more developers, than contractors , and finally, to outsource the projects to India-China-Russia, and ...if no VC funds can be secured to continue the support the applications -than most likely it will be slowly bleeding on $ and soon be going 'belly up' as we say. Why? because the cost of supporting such 'Data Crisis', in other words - data mess, is becoming so expensive and unproductive that customers will be not satisfied with services or results and will be going toward better sites -competitors. So what is the Data Crisis by my definition? Lack of normalized data model, lack of understanding and visibility of Conceptual Data Representation (entity model) of abstract model of doing business, lack of common/reusable modules, high inaccuracy of data (>35%),inability to separate (in logical and physical way) transaction data and data processes from aggregated data (analytics, OLAP, BI, DW), redundancy of data processings and data repositories and etc... Some startups got lucky to be sold at such times to large corporations but most companies are discontinuing operations... That situation is happening first to Internet social networks, it will be happening for businesses that are operating with very large databases ...for different verticals or industries in different time ...but it is here now... Take a look at social networking for example. huge potential to collect data and make it work for improving sales, analysis of users, their behavior on internet (beside porn and stock market) and etc... But most social networks are surviving on the only adsence business - display advertisement by Google. Which is definitely sufficient revenue model if you have millions of views but no room for new business models based on analytics and research on collected data. Why? because they can't process data and store data in normal form of conceptual business model. Companies don't pay attention to data processing,data modeling, data architecture until the data is so large that it becomes inoperative. I have researched 169 companies for last half a year in bay area and San Francisco. Talked, chatted, met 27 start ups asking questions/researching/going through two ways interviews, trying to find out what company did paid attention to data platform and did architect back end operations in such way that business can grow, data can be scaled, data repositories with aggregates can be 'ready' to provide data analysis. 2 companies that I have researched , Quantcast and Zvents, went right direction (my personal opinion). They have started to build CUSTOM solutions for data scaling using Google's direction - MapReduce mechanizm for parallel ETL and DFS (hadoop or kosmos) to store data. These companies have clear understanding of need to develop DATA PLATFORM 'in-house' to scale, store, query data with volumes in a range of 1 billion+ transactions a day. Why only 2 of 169?!!! I guess because of desire to make it 'quick' and sell the business. I talked to one very popular company that runs apps on facebook, owned by former PayPal guys. The hundreds millions of users can't be 'converted' to hundreds millions of $. Why? For me is clear - lack of custom built data platform, lack of possibility to build aggregation repositories to make time series analysis and improve marketing, decision making process on features to be added or discontinued and etc... Vp of marketing had mentioned:" What for to build data warehouse <data platform> ? we can ad-hoc data . Ebay (paypal) did tried to build DW but ended up with 2 people using it". Actually it does indicate a 'data crisis' at certain project at ebay, as large corporation. It shows the lack of data platform that can be integrated, records can be uniquelly identified, data can be scalled in raw format (at it is captured) and etc...therefore, in the end there are only 2 DW customers-users... It again and again proves the case that neglect of data models, master data, normalized data processings leads to inability to analyse data , therefore, inability to build DW to monitor/report on what is going on with business and how to improve the ROI. Unfortunately, I've been hearing 'ad-hoc' ideas quite often because of lack of data platforms and lack of developed technologies built inhouse to query the aggegated repositories. Let's build queries and than we'll see. Again it is 'Data Crisis' too. Instead of investing into data platform to be scaled and be able to grow by volumes, just allocate some $ on ad-hoc database software engineers to 'twist the data queries'. It will be growing like a snowball by resources and 'spaggetti' of code and data... in 3 years I doubt that anybody from management or owners will be able to get accuracy of how to improve the business and get more revenue... So would the company goes with ad-hoc queiries on growing tera bytes of data? - good luck Most likely, first time for analysis will be reduced, than vertical partitioning will help, and in the end cross functional analysis-queiries just won't be possible to run, therefore, very limited visibility of what is going on in business not mentioning lack of any analytical models for forecasting or profiling or etc... That does give an example of how 'Data Crisis' situation can be also created 'by design' as a result of lack technical vision or incorrect interpretation of 'bad' experience. I'd like to give simple example of components that supposed to be a part of architecture but most start up companies have neglected the abstract principles of architecture ...and therefore, have been paying the high price for it...basically , in my words are in a mess of data and data operations that are defined here as 'data crisis'... Following are simplified architecture approach to build the web based applications: Let's say it is for marketing-advertisement business on internet or hosting SAAS applications. Let's define simple bricks-components of software foundation that need to be architected into software application: 1.Front end (Browser-client components) 2. middle tier business components (server side to support UI - front end components) 3.application server components ( business rules execution, data connections,data integration, data collection, queries,data returned by queries, data manipulation to serve front end or back end (ODS),), 4.back end components (data repositories to process and organize (model) data for business needs for TRANSACTIONAL processing) 5.DW/BI, analysis/data mining components (aggregated data to provide business analysis and reporting). I've simplified a little bit definitions. The point is there are 4 of 5 SYSTEM COMPONENTS that are representing data repositories and data processing parts of any business in internet advertisement. Therefore, architecting front end (UI) component and not architecting /modeling data for four other business software layers is a huge mi-stake. But unfortunately,in most startups the cost and other factors such as time to market requirements frequent change of UI frequent change of features lack of architect role/position as a 'gate keeper' lack of resources to develop and test etc... lead to simplifying architecture to Front end and Back end components. Therefore, data is denormalized based on UI funtions as well as all data processing are partitioned based on UI functionality. Data processing, data analysis, cross functional reporting, data mining and etc... is becoming very much challenging if not impossible tasks. Lack of abstract normalized data model (sometimes it is called master data for business) really is the 'data crisis'. This is the #1 problem. Lack of software Architecture based on abstract business model and abstract normalized data model. On data side this problem brings inability: to optimize data processing to speed up transaction processing to normalized and optimized data storage repositories (DBs or Filers) for OLTP to normalized data (star schema) for analysis and data mining (forecasting) as well as fraud protection analysis to integrate external data sources to modify and enhance to make crossfunctional reporting to monitor data and performance of business etc.... it brings snowball of cost to support not mentioning inability to grow business by adding 'new features'.... yes, lack of data architecture and data modeling by design is the 'data crisis'. The #2 problem that can define DATA crisis is the lack of technologies to scale the large volumes of data. When business is growing the larger data volumes need to be processed and need to be stored. That will be requiring 'special treatment' from architect to come up with platform that can scale data and at the same time have met the requirements on performance to query data repositories. The issue is that each of software components need to be 'ready' , i.e. need to be architected for scale and fast query execution. In most start up companies it is not the case by many reasons...some defined above. The #3 problem is inability to foresee or to accept and deal with Problems #1 and #2 . I did not have much luxury to step into company when the development from scratch had started. Most of the time I have served as 'fireman' for companies that are in 'data crisis' situation. Cover or fix 'data crisis'? Hire contractors to do the hardcoded solutions? Hire contractors to blame for failure? Fire / hire full time employees? Restructure/fire/hire managers ? Start redevelopment as NEW project by adding New data platform development group? Hire more managers and developers for a permanent positions to continue support the snowball of problems? Start new development and have a strategic plan to move 'old' business flows into NEw one as a step by step? Restructure the groups and setting up ownership for certain features/applications/systems? Outsource to India or China to reduce cost? Outsorce support and start new development hiring or retraining resources? Start building data platform/technology inside the company to secure next 4-7 years business growth? all of the above has been happening in industry... what are the tradeoffs or temporary solutions in such 'Data Crisis'? Some companies come up with architecture to pre-aggregate data to reduce volumes therefore, losing raw data that would be needed for analysis and clear understanding of business success. It also causing the inability to compete with other companies by adding new features based on analysis of raw data. Example are: 'targeting' of the ads based on user's behavior. forecasting monitoring etc... If not raw data has been saved in data repositories - no analysis can be done, no 'targeting' feature can be applied to get more $ and improve customer satisfaction and etc... Pre-aggregated data will create lots of challenges, as well as adding the cost to keep track of data relationship per transaction - user action. Example, when data is pre-aggregated on account (advertiser) level to calculate amount of money that is left to continue marketing of ads by publisher, than the analysis on what/who/where(geo/demo) does view the ad will be not possible to do as raw data per user action (transaction) will be aggregates to level of account. which is higher hierarchy level than ad...(company-->account-->campaign-->order--> orderitem--> ad) etc.. The lack of business data model hierarchies and lack of having normalized data model will cost lots of money in the long run... some companies - tens millions, some companies - hundreds millions. I work for company that have spend 1 billion for 4 years and still had failed in the end...losing money....guess which one? Very expensive data processing, staging, mapping, matching, cleansing software solutions need to be built as a temporary tradeoff. But in a long run - the problem of keeping raw data for analysis and optimization to process the data /aggregate data won't disappear. Some companies would use temporary staging repositories and place raw data as daily snapshots for queries that need to be performed for analysis, basically 'hardcoding' the elements that are needed to be stored in a separate repository.... Therefore, those repositories can be ad-hoc(ed) to do on fly data analysis for fixed time period (day or month). Still , full analysis of data to compare different KPIs based on different time series is impossible. It required lots of work and investment to keep those queries running as data is growing and again, tradeoffs need to me made : limiting time series to query data, partition data to allow longer history to be queried ,etc... lots of work for ETL and database engineers to keep it going... etc... I can continue to list issues and temporary solutions ...long list...and each time unique implementation... that I have been participated when dealing with 'Data Crisis' situations... But the point of the article is - why to create this grounds for data crisis on the first place? when any rational is saying - find right architect, spend $ on architecting, modeling, developing in-house data platform, and ...you will be in win win situation anytime after 6 months of development effort (life cycle for first version What to do to avoid DATA CRISIS versus FIXING it ? Invest from the begining into DW and data platform . Simple? yes. Doable? yes. Design - Build Distributed Data Processing Platforms or use vendors to scale large volumes of data. Design - build distributed DW solutions or use vendors. FROM THE VERY BEGINING OF business. Cloud computing, or Grid of Computers , is new technology to scale and to speed up queries but the technology (list of vendors and what they up to There is no 'perfect' mathematical algorithm that had been developed so far to optimize Parallel execution of the query but there are plenty of working solutions that can be a starting point for your data platforms. But keep in mind that problem #2 need to be dealt by ARCHITECTS, experienced with very large volumes data processings solutions. No 'golden egg' Vendor software would be found nowadays to solve the Problem #2 or to build single technology solution for OLTP and DW data repositories. What to do? simple suggestions: Hire experienced (hands-in) Data Architect/technology visionary to review each application flow and start building the plan to reorganize the data operations/ data platforms to be ready in 4 years to stay in business with nice figures on ROI. Develop in-house data processes (data platform) for transactional (raw) data. Develop unified data model based on abstract normalized data model for your business IMprove data processes based on this Unified (master data) model identify critical data processes (data elements) to allow company to grow Start building custom data analysis systems (bi, monitoring, dw,reporting) based on - requirements on latency - frequency of change for data elements (richness) - dynamicity of hierarchies (relationship between entities and data elements) - etc.. What technologies to use for data to be scaled? As we all know, it is up to architect and team to make a decision and mitigate the risks. each business is very different by IT situation and business rules. But start from DATA. Conceptual data model, logical data model and data flows will help tremendously to build the optimum solution. There too many vendors that are currently trying to fill up 'scale with computer grids', i.e. distributing processing, distributed DW, distributed networks... Do research and feature comparison before any steps to use any of vendors. Several open source projects that can help to start building the distributed data platform based on Map Reduce mechanizm. Some DFS systems are already in use but open source does have mostly very basic start to build on top - Hadoop(java based), Kosmos (C++ based)and etc... I have found/researched a list of vendors and open source players in distributed - Cloud computing space and put it in my previous blogs couple months ago... What is the working architecture that I'll be building for more than 1bln transactions a day with low latency reporting requirements (10 minutes or less)? I'd try to build on top of Kosmos and Hypertable for data processing , aggregate data and put into distributed file system to be batched into vendors' OLAP products (Oracle, MSAS, SAP) to query the aggregated data. Again, some vendors like Greenplum, Teradata, Informatica, Oracle might serve as a starting base for your DW and ETL needs. Ideally, i'd like to see the Framework (metadata driven) to support parallel execution of jobs having failover mechanizm, mechanizm to support late arriving data, mechanizm to build que and change priorities of the que on fly, mechanizm to work with more than one cluster (all monitoring features for network traffic and distribute load on nodes not only on one cluster but on several) and etc... I do have a full list...that I'd like to continue to work on... Good news is that financial crisis gives an opportunity to slow down on throwing money on startups that I have mentioned above and may be consentrate time and effort on developing the technologies for data processing on distributed networks and , it would be top of the line , to find/develop the algorith for parallel query execution as the base for all distributed data processing. So far I'd say that we are in transition period for technologies that need to deal with huge volumes of data. Data Crisis is bad news for any company. But at the same time 'Data Crisis' is good news for industry as it will drive, and it has been already happened, the progress in technology of parallel query processing and distributed data operations. It is cool to be data architect and come up with solutions to challenge DATA CRISIS situations. November 07 Financial Crises - What to do?Media and news are very controversial and misleading in explanation of current Critical condition of USA Credit Market collapse. I found that this article , by Igo Baskin includes some interesting simple examples and explanation how credit system had supported pyramids with no 'real money' secured by FNM , or basically US government. Unsecured loans and etc...Collapsed credit system need to be replaced not mentioning global crizis. Part of which is that the investment into US banks and government papers considered to be secure by other countries national banks. However, it is not secure buying paper from institutions that are broke by issuing not secure loans. etc... I am not sure that I agree completely on 'What to do?' suggestions but overall I like the simplicity of presenting the info in the article. The article is a little bit scary ,with no hope, cruel but rational and realistic - that is what Russian mentality is about. Small thing though - you have to read in Russian Мы вступаем в полосу глобального экономического кризиса. Для подавляющего числа людей это будет время крушения привычного, достаточно комфортного, образа жизни и огромных материальных потерь. Катастрофически упадёт в цене недвижимость, пропадут все деньги, вложенные в ценные бумаги, растают как снег на весеннем солнце пенсионные накопления. Еврейская поговорка:"Лошадь сдохла - надо слезть." Казалось бы все ясно, но... не надо уговаривать себя, что есть еще надежда не надо бить лошадь сильнее не поможет то ,что "всегда так скакали" не надо оживлять дохлых лошадей или организовывать мероприятия по их оживлению не надо собирать аналистов , чтобы проанализировать дохлую лошадь не надо "озивлять", что умерло не надо нанимать специалистов, которые помогут сдохнуть другим лошадям и т.д. ЛОШАДь СДОхла!!! ПОРА СЛЕЗТь... New global financial system need to be built. Old one has died. get rid of 'dead horse' and grow up a strong young one - that is the conclusion of Mr. Baskin's article. Start growing up the young financial system... Confident that it will be happening SOONER than middle class in USA starts to melt down ... October 26 Похождения бравого солдата Швейка во время Мировой войны Finally i found the book that i've been looking for a while. Гашек Ярослав | Hašek Jaroslav Похождения бравого солдата Швейка во время Мировой войны October 22 Microsoft Analysis services 2008 unleashed New book on Analysis Services 2008 is coming up. Great news and pleasant feeling as I have worked with Irina, Sasha and Edward at Microsoft I have first book signed by them and will catch them in Seattle to sign the second one. Look forward to read and review it. October 18 Forecasting and analytics :SPSS or SAS? Windows or Unix?Pls, find very subjective opinion and I'd like to learn more about both packages when implementing models in practics.
I've decided to put quick overview of what I've been experiencing so far.
Which statistical package to use, and on what platform? SAS or SPSS? Small datasets or for novice in statistical modeling - i guess,SPSS might be the better choice. windows version is easier to use than UNIX one. SPSS is easier for entering data by user. MS Windows version of it is much faster than the Unix version running in X-Windows. For researchers with large datasets and more complex statistical analyses, SAS may be the better package. Running under either MS Windows or Unix, SAS is currently more powerful than SPSS, as well as more complicated. On both systems, SAS now has better graphing capabilities. For general data management, SAS possesses certain advantages over SPSS. With SAS, it is easier to merge and to concatenate datasets as well as is easier to pipe the output from one dataset into that of another (SPSS). It is easier with SAS to take the output of one statistical procedure and feed it into the input of an another statistical procedure. SPSS value labels are easier to form than SAS variable formats. SPSS is more modular and less flexible in its data management than SAS. But for data entry,i think, SPSS for Windows allows for easier input.The number, power, and flexibility of SAS statistical procedures are generally better than those of SPSS. For categorical data analysis, SAS offers more tests than does SPSS. SAS also contains a wider variety of regression and anova (analysis of variance) procedures than does SPSS. SAS Graph far exceeds the current capabilities of SPSS Chart. Based on my trial use, preparation for proper usage of the SAS system, with their greater variety of options, involves much more homework than for proper usage of SPSS. Overall, it is a common concept that SPSS is more user-friendly, but for the advanced user or statistician, SAS may be powerful than SPSS. October 16 Financial crisis - analytics "pigs will be slaughtered" - that is how money managers talk about 'entities' which are failing. in nowadays iceland is the first pig,isn't it? who is the next one? take a look at a chart and make your guess. Unfortunately,The high tech startups will also see lots of 'reduction of force'. TechCrunch is keeping latest info at deadpool. October 10 Microsoft BI Conference 2008 I did not go this year to MS BI conference in Seattle . Some materials are here It is interesting 'battle field' for technologies that deal with large volumes of data (100+millions of rows, hundreds terabytes per day ). Open source projects and Linux vendors versus Microsoft's MSAS, Kelimanjaro (SQL server 2010 =SQL srv2008+Zoomix+dataAllegro) , Gemini ,. The challenge for BI/DW/Decision support systems to work with large volumes (collect,scale, aggregate, store, query, display) is here. BI/DW success will be really depend on resource competency in the Company (architects/devs/pms) rather than on a single technology to be rely on. Based on rumors and limited understanding what is Gemini (MSAS 2010) it sounds like caching mechanizm of cubes in memory to be able to speed up MDX execution against large cubes (100mln rows in fact table). Which will be the huge improvement for performance,not sure about reliability: failover, persistancy of updates mechanisms. How great it would be if MSAS will also have CONCEPT of scalability (parallel query execution of MDX against several nodes having horizontally partitioned records in the cubes) as new feature for performance improvement of query execution. I would love to see the ability to connect to data files and process data into cubes directly from Files residing on one machine...several machines... wouldn't it be great?! ...From several files residing on several nodes(machines) - it would of be awesome!!! Looks like Amir Netz is 'back in BI business' and hope to see his 'new baby' as technology killer for BI engines! October 01 CEOs - survival of the UNFITTEST What a great metaphor by Mr. Icahn! Survivor of UNFITTEST! Icahn Report "He <CEO> would never have anyone underneath him as his assistant that’s brighter than he is because that might constitute a threat. So therefore, with many exceptions, we have CEOs becoming dumber and dumber and dumber. We can all see where this is going. It would almost be funny if it wasn’t such a threat to our ability to compete and to our economy in general." strong but very true. Look at financial crisis and how many millions of $ the CEOs,some of them being 5 days !!!! on a job, of failed financial institutions are getting on departure. Unfortunately, Software industry has plenty of such 'paid for failure' examples as well. September 30 Yahoo's AMP - Advertisers-Publishers Exchanges platform - failure or success? I'd like to put some "news available on internet" in order to answer all your questions in one blog about 1. What is Panama, Right Media,APEX, APM, ATP? <S> It is the project that is going on since 2002. The goals were: a) to build better system for Advertisers/Publishers' e-marketing campaigns on internet b) to build DW/BI/Analytics/Targetting platforms -solutions to work with very large volumes of data - petabytes - c) Improve ROI for advertisers and publishers d) Find new monetizing techniques for yahoo's business Simplifying all above - to build the ad serving and data processing platforms to compete with Google adsense and adwords. 2. success or failure ? september 2008 -->APT VIDEO june 2008 -->Apex+RighMedia april 2008 -->AMP april 2008 -->APEX may 2006 -->Panama directions Those news are not my personal opinion on APEX team or Yahoo's executives during the attempt to sell off the company. However, I can say that I have been through very positive experience in building scalable solutions with very talented software engineers, as well as negative experiences during 'transition period' of yahoo's attempt to sell off and downsizing. It is great to see some results of hard work at APT VIDEO. I really wish the best to Panama (ATP) team,or what has been left from it, to succeed in contributing to yahoo's goals and objectives in scaling data and in building DW applications that will help to optimize the Advertiser/Publishers marketing efforts. September 04 Web Analytics sites - Compete, Alexa, Quantcast Compete looks like have more precise data on traffic than Alexa. But I really like Quantcast that is catching up with Compete and Alexa stats , and I think has started user behavior analytics - suggesting the CATEGORY of the site, USER BEHAVIOR - what sites the user also visited as well as what keywords the user of this site had searched for. Compete displays also the keyword that drives the most of traffic to the site however it does not help in targeting right users with right ad campaigns as keyword can belong to so many market segments !!!! Quantcast has shown stats about users by demo,age as well as started to touch PROFILING THE USER trying to assign the visited sites to CATEGORIES (market segmentation) they will likely to visit after visiting this domain/site, therefore, giving possibility to assign Users of this site to certain Market Segments in order to target those users with right advertisement content. "Audience also likes" , for example, I have check marketmetrix.com. The business model is in servicing the hospitality industry. And Quantcast has given exact match for Business vertical suggesting 3 categories to assign this site to: Hotels, Airlines, car rentals...which exactly identifies the marketmetrix business vertical. Cool! "Audience also visits" - even more information on domain name!!! that users of this site has visited - again very close match to marketmetrix Business Vertical. "Audience also search" - this part looks like out a little bit "out" as keywords that are shown on Quantast are not really much related to Hospitality Vertical. Quantast is definitely has shown the BEGINING OF stats-analytics that help to understand the user behavior on this site and attempt to assign this site to certain market segments based on user behavior/traffic , not just displaying stats on amount of users (traffic) or returning users... Where former yahoo execs now?TechCrunch posted the List of former execs who left yahoo here August 20 Analytics and AB testing How to improve click rate? How to bring more users ? How to optimize ad serving ? How to segment Users per Market and products for better ROI? What attributes to include in stats modeling for user profiling? How to build better User experiences to attract more site visitors? AB testing does help to run numbers and make right decisions based on experiments (part of web analytics/BI/dw). Great white paper from MS Experimental Platform (ab testing) on "Seven Pitfalls to Avoid when Running Controlled Experiments on the Web" as well as great blog on AB testing by Andy Edmonds, former MS Live scientist . the ability to experiment easily is a critical factor for Web-based applications. The online world is never static. There is a constant flow of new users, new products and new technologies. Being able to figure out quickly what works and what doesn’t can mean the difference between survival and extinction. – Hal Varian (Varian, 2007) And this is a huge challenge for middle size or small companies to build Experimental Data Platforms to drive business decisions. Data platform means scalable solutions to process and to aggregate data, therefore, it is DW platform. Which is not clearly understood by young entrepreneurs in small startups that data analysis is possible only on top of working data platform: data structures, data flows, data processes, data aggregations, tools and apps to query/display the results of analysis, as well as tools for data mining and stats modeling... Again I'd like to emphasize that Analytics and BI can't be done at a FULL accuracy of results as ad-hoc queries from Data that supports business applications, for example social networks UIs , widgets, ad serving and other operational data repositories for business needs/UI needs. And this is the major misdirection of vision on building analysis applications with the same architecture approach as for UI apps that drives the Web Display part of the business. In order to be able to compete and survive the DW platform need to be built, therefore, technologies/solutions need to be brought in place for data processing and data aggregation and data querying, as well as modeling. Most companies, small startups that I have priviledge to talk/chat/interview in social networking space, have facing up the issue of not possibility to query volumes of data by ad-hocing, not mentioning the challenges to collect the growing volumes, therefore, need the data platform solutions to scale data in order to continue to grow the business .And this is $ and resources. But very few of startups that I have talked in recent 6 months, I would say 2 out of 18,have the understanding and vision to build the foundation for data processing and aggregations, i.e. Data Warehouse. Another challenge that Agile approaches for building Web Site and adding features to web site frequently have led to highly denormalized back end data structures and making such 'structures' or data models work for analysis is very difficult, almost impossible task , without cleansing/formatting/matching/mapping and etc.. to normalized and to avoid duplicates/redundancy of information. only after such data effort, the aggregations can be built based on Questions that will identify the DW solutions: data mining/BI. And when this foundation is built - data analytics/business intelligence/stats can to provide CORRECT results that will help to build and grow business.... As this post is about best practices with AB testing, which is part of data analytics/mining tasks set/built on top of DW platform. THat means DEDICATED budget and resources, as well as ownership. Ownership is a separate topic as many startups having certain mlns of dollars are throwing it into consulting companies versus building inhouse expertise. Success of AB testing experiments and display of results will always rely on solid DW platform/ data foundation that needs to be in place. August 18 How is USA economy doing?How is USA job market in IT doing?I implement 'sample analytics' ...just kidding The logic is very simple - go to some popular 'IT recruiting' user groups, like "Resumes-in-IT" and follow up the pattern . And? How USA job market in IT is doing ? The answer is in RECENT ACTIVITIES: Post Activity
August 16 Vendors with DW solutions to scale dataPodcasts are here Denomo Truviso IBM Swashup Dataupia Intel MashMaker StreamBase. PS>I won't name Streambase as Vendor for DW but rather would say implementor of Event driven programming. In theory, Event is the data structure that serves as INPUT format into processing pipe. the richness of such data structure will have direct impact on performance of the pipe . Therefore, in reality, Event may corresponds to several data structures in order to optimize the performance of data pipes or to support different DW consumer needs for low latency, high availability, traffic picks, critical data flows and etc... THerefore, theory that Unified Data structure can be created by Event and can be processed in data pipe as is (no need for vertical partitioning) might stay as a goal for architecture but very difficult to achieve as, i'd like to say it again, richness of data and volumes of data, as well as frequent changes of requirements on latency,performance, aggregations and etc.. may lead to designing several data structures based on Event. August 09 Analytics and Video Ads Somehow, the simple idea of surveys as SOURCE of data FOR ANALYTICS is kind of forgotten in current boom of Social networks with Video Ads.
Before trying to build complex data collection/cleansing/processing/aggregation applications while trying to figure out monetizing paths for video ads ... why not to crunch figures from direct users/publishers/advertizers polls? versus trying to 'fish' PV/users/ipaddresses/exclude fraudulent - robots clicks/cleanse inaccurate data from log files/disturb "privacy laws" while matching users' demo/gender stats to results and etc... I am applauding the TubeMogul blog - simple and clear ANALYTICS based on direct users' survey!!!! Not yet a statistical analysis or modeling/atributing for profiling or optimizing techniques for publishing video ads and etc...but clear path to CLARIFYING the attributes that might help to define the next steps for identifying product classes, narrowing down Group of Users/Publishers, building market segmentation and etc...TO OPTIMIZE ROI for publishing video content. Clear business model. This what Business Intelligence/Statistics/Analytics, first step of course, is for - GET NUMBERS from Questions-answers (SURVEYS) directly from CUSTOMERS and then, based on tracking numbers of PV/users and etc.., start building the suggestions/optimizations and etc.... The numbers from tracking PV/Users(demo,geo,properties) are too wide for making any Stat modeling or even suggest any sampling techniques with probabilities extended to overall data volumes. Plenty of work to do for Data Modeling,Data Mining and Data Analysis teams to build tools to track and to crunch numbers in order to helping companies to target right audience with right ads and...etc... which is the end results for any ANALYTICS tool/application to get increase of revenue and improve sales. And again it is a process, which is part of DW effort, which is not flexible, as I have mentioned in some previous blogs, and required skills, time and $ signs to design/build/support/maintain. Web Analytics is a long way to go and , on my opinion, it is neither quick ad-hocing nor simple linear algebra for calculating mean or averages based on stats from PV/users/Clicks(CPC)/Display (CPM). Modeling and Stats techniques for marketing and profiling are required and this is the next step , which takes some time to comprehend, for quick growing startups building ad publishings campaigns/advertisement businesses on social networks. |
|
|