Data science and IT is no longer a distinct industry. Think of it as a partnership.
World Data Science in its most pure state is created by parallel processing servers that mostly run Hadoop and execute in batch mode, the large data warehouses that these processors operate, and the data scientists are statistically trained and scientifically unaware of IT or about the requirements of maintaining IT operations.
Although there are organizations that include data science majors in IT and are therefore specialized in IT management and support nearby, there are some companies that operate their data science departments independently of IT. These parts have fewer clues about the IT industry needed to sustain and support the health of a large data ecosystem.
This is also why many organizations are exploring the importance of having data science and IT work together.
For CIO and data center leaders who need to be heavily involved in the IT Data Science partnership and what important facilities should be covered to ensure IT support for data science operations?
Two or three years ago, it was a fundamental rule that Hadoop, the most dominant data/data science platform in the company, ran in batch mode. This makes it easy for organizations to run large data applications on commodity computing hardware. Now, with the transition to greater real-time processing of data, commodity hardware is switching to in-memory processing, SSD storage, and the Apache Spark cluster computing framework. This requires powerful handling which can necessarily be done by the commodity server. It also requires IT know-how to configure hardware components for optimum processing. Already familiar with a fixed record, environment calculation transaction, not all IT departments have the skills of staying to work with or tweak parallel processing in memory. This is a technical field where IT may need cross-training or recruitment.
In the Hadoop world, MapReduce is the dominant programming model for processing and creating large datasets with distributed algorithms, in parallel on a cluster. Apache Spark treats in memory, allowing for real-time Big data processing. Organizations are moving to more real-time processing, but they also understand the value that Hadoop brings in a mass environment. From the software perspective, IT must be able to support both platforms.
Most IT departments work with hybrid compute infrastructure including internal systems and applications in the data center, combined with private and public cloud systems. This has required IT to think outside of the datacenter and implement management policies, processes, and operations for systems, applications, and data that may be in the home, in the cloud, or both. In terms of operation, this means that IT must continue to manage its internal technology assets, but also work with cloud vendors that manage the technology assets that are outside or work on their own in the cloud if assets are only stored, with businesses continuing to manage them.
Support for data science and big data in this more sophisticated infrastructure will take responsibility for IT technology management going one step further because the management goals for big data differ from traditional to fixed data.
Among the support issues for big data IT must decide is:
How much big data, huge and continuous building, should be stored, and what data should be discarded?What are the storage and processing price points of cloud providers and at what point do cloud storage and processing become more expensive compared to comparable products in their homes?What is the disaster recovery plan for big data and its applications, what are becoming important tasks for organizations?
Who is responsible for the SLA, especially in the cloud world, when the big data production incident occurs?How does the data be disturbed securely and securely between the cloud and the data center?
Data scientists have expertise in statistical analysis and algorithm development, but they do not necessarily know how much or what data is available for them to operate. This is a field where it excels because its organization charter is to track all data in enterprise storage, as well as incoming and outgoing data.
If a marketing manager wants to develop customer analyses that factor in certain events that are stored internally in a customer file and in the purchase and service history of customers with the company-and managers also want to know what customers are interested in tracking the customer’s activities on the website and social media-IT is Who knows most when identifying all the paths to achieving an overall picture of customer information. And this group has a database team, working IN parallel with the other IT departments, development JOINS datasets that aggregate all data so that the data scientist development algorithms can operate on it to develop the most honest outcome.
If there is no expertise in terms of knowing where the data is and how to access and synthesize that data, the data science and analysis engineers will be challenged to go to the correct insights that can benefit the business.
IT support of data science activities is the main pillar of business analysis success.
IT allows data scientists to do what they do best – the algorithm designed to exploit the best information from the data. At the same time, IT is participating in its best wheel classroom – knowing where to find data and synthesize data.