
Microsoft has introduced two new data science utilities. The tools give data scientists more ability to focus on specific tasks and generalize data code. The two utilities are Interactive Data Exploration, Analysis and Reporting (IDEAR), and Automated Modeling and Reporting (AMAR).
The two utilities are Interactive Data Exploration, Analysis and Reporting (IDEAR), and Automated Modeling and Reporting (AMAR).
In its official announcement post, Microsoft says IDEAR and AMAR help data scientists answer the following questions:
- What does the data look like? What's the schema?
- What is the quality of the data? What's the severity of missing data?
- How are individual variables distributed? Do I need to do variable transformation?
- How relevant is the data is to the machine learning task? How difficult is the machine learning task itself?
- Which variables are most relevant to the machine learning target?
- Is there any specific clustering pattern in the data?
- How will ML models on the data perform? Which variables are significant in the models?
It is time consuming to write code that can answer those questions. Generalizing the code into a utility means productivity can be increased as the code can be reused across projects.
IDEAR
The Interactive Data Exploration, Analysis and Reporting tool (IDEAR) presents data through interactive information.
Microsoft has integrated the Shiny Library from R Studio into the utility. This means users can export R scripts of visualizations and analysis results. This R log file is achieved through the “Generate Report” option.
Features of IDEAR include:
- Automatic Variable Type Detection
- Variable Ranking and Target Leaker Identification
- Visualizing High-Dimensional Date
AMAR
The Automated Modeling and Reporting tool, or AMAR, is a customizable tool for machine learning models. It makes comparing accuracy and hyper-parameter sweeping across models easier. When running, AMAR will produce an HTML model report with the following information:
- A view of the top few rows of the dataset used for training.
- The training formula used to create the models.
- The accuracy of various models (AUC, RMSE, etc.), and a comparison of the same, i.e. if multiple models are trained.
- Variable importance ranking.