# Big data and machine learning

The relevance of this work is due to the emergence of a new type of data – Big data, which open up new opportunities for almost every sphere of public life. The problem lies in the complexity of processing such volumes of data, for which it was created, a large number of devices and programs, in particular, machine learning.

The goal is to look at big data, machine learning, and find out how machine learning works by example.

Tasks:

1. Search for relevant information on a given topic in various written sources and resources of the Internet;

2. Analysis of the found information, its generalization and systematization, adding own observations;

3. Structuring and presentation of the data obtained;

4. The formulation of the relevant conclusion, the answer to the goal question

The methodological basis of this work includes: analysis and generalization of special literature, publications in periodicals, observation, description, illustration by examples, formulation of own opinion and conclusion.

The concept of “big data” is not unambiguous definition, you can find many interpretations and versions. They are United by one thing — big data means a set of special technologies. They are used to process a much larger amount of data (from petabytes: 1015 bytes) than it was before the advent of “big data”. Data is defined as a set of objects and a set of corresponding answers (responses). In addition, big data needs to be able to work with this large amount of data quickly, as well as handle both structured and poorly structured data.

Big data continuously accumulates in almost any sphere of human life. This includes social media, medicine, banking, advertising, as well as device systems that receive numerous results of daily calculations. For example, astronomical observations and meteorological information.

Information from a variety of tracking systems in real time comes to the server companies that use big data.

Big data technologies are inseparable from research and Commerce. Moreover, they are beginning to capture the sphere of public administration – everywhere requires the introduction of more efficient storage systems and the determination of information.

There are a large number of techniques and methods for analyzing and processing such information. Among the main ones are the following:

Class methods or deep analysis (Data Mining).

These methods are based on the use of special mathematical tools in conjunction with developments in the field of information technology.

Crowdsourcing.

This technique allows to obtain data simultaneously from almost unlimited number of sources.

A/b testing.

From the total amount of data, the total set of elements is selected, which is alternately compared with other similar sets, where one of the elements has been changed, which helps to determine which parameter changes have the greatest impact on the set.

Predictive Analytics.

This method is aimed at anticipating and planning how the controlled object will behave in order to make the most profitable decision in this situation.

Machine learning (artificial intelligence).

The method is based on empirical analysis of information and subsequent construction of algorithms for self-learning systems.

Network analysis.

After the statistics are obtained, the nodes created in the grid are analyzed, that is, the interactions between individual users and their communities.

As mentioned above, machine learning is one of the methods of big data processing. Let us consider it in more detail.

Machine learning is a mathematical discipline that solves the problem of finding patterns in empirical data; on their basis, the algorithm can give certain predictions. Machine learning can be attributed to the methods of artificial intelligence, as it does not solve the problem directly, and is trained to apply the solution to many similar problems.

Machine learning lies between mathematical statistics, optimization techniques, and classical mathematical disciplines, but it also has a distinctive feature associated with problems of computational efficiency and retraining. Many methods are also closely related to information extraction and data Mining. [1]

Machine learning eliminates the need for a programmer to explain in detail how to solve a problem. Instead, the computer is taught to find a solution on its own.

The algorithm retrieves a set of required data and then uses it to process the requests. For example, you can upload the code of several photos with the following description: “this photo shows a dog” and “this photo does not have a dog”. If you then download a large number of new images to your computer, it will start sorting the images itself.

Machines learn to see images and classify them, as in the photo example. They can recognize text, numbers, people, and terrain in these images. Computers not only identify the distinctive features for sorting, but also take into account the context of their use.

Of course, you should not expect 100% of the correct result, and errors occur. Correct and erroneous recognition results get into the database, thereby allowing the program to learn from its mistakes and better cope with the task. Theoretically, the process of improvement can develop indefinitely. This is the essence of the learning process.

There are several common machine learning methods: teacher-centered learning (most common and working), unsupervised learning, and reinforcement learning.Two

The concept of the first method is that a training data set is loaded into the system – a “training sample” in which the information is divided into pairs: input data and output data. The task of the computer is to understand the logic that connects pairs, create an algorithm and use it to combine new data into pairs. The system is constantly improving, the analyzed data become its “experience”, it takes into account errors and seeks to minimize their assumption in the future. That’s how learning happens.

Example: there are data on 1 000 000 apartments in Moscow; each known area, number of rooms, floor, location, availability of Parking, and so on. In addition, the cost of each apartment is known. The task is to build a model that will determine the cost of the apartment on the basis of these features. This is a classic example of teaching with a teacher where we have data (different parameters for each apartment) and feedback (apartment cost). This is called a regression problem.

Training without a teacher – training without tips. The training sample consists only of input data. The task of the program is to identify all kinds of dependencies and relationships between the specified parameters. Since the data is not paired, the system does not have a “cheat sheet” with the correct answers. The system makes its own conclusions about the links. The expected result is the division of information into clusters or detection of deviations from the entered parameters. With each new process, the system will learn to classify data more accurately.

Example: let’s say we know the data on the growth and weight of a certain number of newborns. It is necessary to group the data into 3 groups in order to release the children’s sliders of the appropriate size for each group. This is a clustering task. It should be noted that the division into clusters is not so obvious and often there is no “correct” division.

Another approach is reinforcement learning. It involves the interaction of the system with the environment. This interaction causes a response, positive or negative. This helps the program gradually discover the best ways to stabilize the response.

After thousands of hours of calculations and operations, the system trained by any of the methods is ready for the unknown. Its well-developed algorithms are capable of forecasting, classification, clustering of fundamentally new data. In the process of processing the model will continue to learn and improve. The learning process is as long as the base is replenished. [1]

Machine learning methods make it possible to better understand the client, facilitate the search for goods, increase conversion, assess the risks associated with certain investments. According to a survey conducted by MIT Review Custom and Google Cloud, 60% of respondents representing a wide variety of companies said that machine learning for them is the main way to process data sets. Among the main motives of the survey participants called the desire to extract new knowledge from their data, to gain a competitive advantage, to accelerate the analysis of information and the production of a new generation.

Two years ago, Facebook, whose algorithms are built using machine learning, opened the code of the software it uses. Last year, the library for machine learning Google’s TensorFlow became open. It helps IT professionals understand how machine learning models work and integrate them into their products. Twitter, Uber.

These platforms have greatly reduced the importance of understanding the algorithms on which machine learning is based. With open tools and commercial cloud solutions, the use of machine learning has become more accessible. Experts predict that very soon any man in the street instead of piles of code and algebraic calculations, will be able to use the mechanisms of machine learning, using a clear interface.

In this way, big data becomes an integral part of our everyday existence. Every day a huge amount of information about us and our preferences helps to create complex concepts of smart homes, smart cities, the Internet of things and so on. Thus, big data analysis helps to improve and transform almost all aspects of our lives.