A clean multi-platform re-implementation of dataforge concepts
Go to file
2020-06-24 16:01:36 +03:00
.github/workflows Working on zip and directory storage for data. Update to build 0.2.4 2019-11-17 22:15:29 +03:00
dataforge-context 0.1.8-dev-2. Fix build 2020-05-17 20:20:20 +03:00
dataforge-data Cleanup type variance in Task builder 2020-04-13 15:34:26 +03:00
dataforge-io 0.1.8-dev-2. Fix build 2020-05-17 20:20:20 +03:00
dataforge-meta Table refactoring 2020-06-24 14:40:55 +03:00
dataforge-output Fix zip writer and name refactoring 2020-04-06 22:50:51 +03:00
dataforge-scripting Move scheme to root meta package to avoid import clutter 2020-03-28 10:25:24 +03:00
dataforge-tables Row type fix 2020-06-24 16:01:36 +03:00
dataforge-workspace Cleanup type variance in Task builder 2020-04-13 15:34:26 +03:00
docs/images Table refactoring 2020-06-24 14:40:55 +03:00
gradle Update gradle wrapper to 6.5 2020-06-06 21:10:02 +03:00
.gitignore query in Name renamed to index 2019-03-18 20:23:52 +03:00
build.gradle.kts Table refactoring 2020-06-24 14:40:55 +03:00
gradlew Update gradle wrapper to 6.5 2020-06-06 21:10:02 +03:00
gradlew.bat Update gradle wrapper to 6.5 2020-06-06 21:10:02 +03:00
README.md Commend some irrelevant parts of README.md 2020-06-07 17:23:50 +03:00
settings.gradle.kts Tables basics 2020-01-29 21:34:51 +03:00

JetBrains Research DOI

Gradle build

Download

Questions and Answers

In this section we will try to cover DataForge main ideas in the form of questions and answers.

General

Q: I have a lot of data to analyze. The analysis process is complicated, requires a lot of stages and data flow is not always obvious. To top it the data size is huge, so I don't want to perform operation I don't need (calculate something I won't need or calculate something twice). And yes, I need it to be performed in parallel and probably on remote computer. By the way, I am sick and tired of scripts that modify other scripts that control scripts. Could you help me?

A: Yes, that is the precisely the problem DataForge was made to solve. It allows to perform some automated data manipulations with automatic optimization and parallelization. The important thing that data processing recipes are made in the declarative way, so it is quite easy to perform computations on a remote station. Also DataForge guarantees reproducibility of analysis results.


Q: How does it work?

A: At the core of DataForge lies the idea of metadata processor. It utilizes the statement that in order to analyze something you need data itself and some additional information about what does that data represent and what does user want as a result. This additional information is called metadata and could be organized in a regular structure (a tree of values not unlike XML or JSON). The important thing is that this distinction leaves no place for user instructions (or scripts). Indeed, the idea of DataForge logic is that one do not need imperative commands. The framework configures itself according to input meta-data and decides what operations should be performed in the most efficient way.


Q: But where does it take algorithms to use?

A: Of course algorithms must be written somewhere. No magic here. The logic is written in specialized modules. Some modules are provided out of the box at the system core, some need to be developed for specific problem.


Q: So I still need to write the code? What is the difference then?

A: Yes, someone still need to write the code. But not necessary you. Simple operations could be performed using provided core logic. Also your group can have one programmer writing the logic and all other using it without any real programming expertise. Also the framework organized in a such way that one writes some additional logic, he do not need to thing about complicated thing like parallel computing, resource handling, logging, caching etc. Most of the things are done by the DataForge.


Platform

Q: Which platform does DataForge use? Which operation system is it working on?

A: The DataForge is mostly written in Java and utilizes JVM as a platform. It works on any system that supports JVM (meaning almost any modern system excluding some mobile platforms).


Q: But Java... it is slow!

A: It is not. It lacks some hardware specific optimizations and requires some additional time to start (due to JIT nature), but otherwise it is at least as fast as other languages traditionally used in science. More importantly, the memory safety, tooling support and vast ecosystem makes it №1 candidate for data analysis framework.


Q: Can I use my C++/Fortran/Python code in DataForge?

A: Yes, as long as the code could be called from Java. Most of common languages have a bridge for Java access. There are completely no problems with compiled C/Fortran libraries. Python code could be called via one of existing python-java interfaces. It is also planned to implement remote method invocation for common languages, so your Python, or, say, Julia, code could run in its native environment. The metadata processor paradigm makes it much easier to do so.


Features

Q: What other features does DataForge provide?

A: Alongside metadata processing (and a lot of tools for metadata manipulation and layering), DataForge has two additional important concepts:

  • Modularisation. Contrary to lot other frameworks, DataForge is intrinsically modular. The mandatory part is a rather tiny core module. Everything else could be customized.

  • Context encapsulation. Every DataForge task is executed in some context. The context isolates environment for the task and also works as dependency injection base and specifies interaction of the task with the external world.



Misc

Q: So everything looks great, can I replace my ROOT / other data analysis framework with DataForge?

A: One must note, that DataForge is made for analysis, not for visualisation. The visualisation and user interaction capabilities of DataForge are rather limited compared to frameworks like ROOT, JAS3 or DataMelt. The idea is to provide reliable API and core functionality. In fact JAS3 and DataMelt could be used as a frontend for DataForge mechanics. It is planned to add an interface to ROOT via JFreeHep AIDA.


Q: How does DataForge compare to cluster computation frameworks like Hadoop or Spark?

A: Again, it is not the purpose of DataForge to replace cluster software. DataForge has some internal parallelism mechanics and implementations, but they are most certainly worse then specially developed programs. Still, DataForge is not fixed on one single implementation. Your favourite parallel processing tool could be still used as a back-end for the DataForge. With full benefit of configuration tools, integrations and no performance overhead.


Q: Can I use DataForge on a mobile platform?

A: DataForge is modular. Core and the most of api are pretty compact, so it could be used in Android applications. Some modules are designed for PC and could not be used on other platforms. IPhone does not support Java and therefore could use only client-side DataForge applications.