Show HN: Desbordante 2.0

Hi! We are excited to announce the second release of Desbordante — an open-source, high-performance data profiler that is capable of discovering and validating many different patterns in data using various algorithms. This major release brings a lot of improvements. Its primary focus is Desbordante’s core: we add several new primitives for pattern discovery.

Changes:

* New feature: discovery of exact order dependencies. This primitive allows you to discover patterns related to orderings of columns, e.g. pay increasing with grade. It is available with two different axiomatizations — set-based and list-based. The latter is faster, but may miss some dependencies, while the former is more accurate, but computationally more demanding. Note that they present dependencies in different formats.

* New feature: discovery of probabilistic functional dependencies for both existing metrics: PerTuple and PerValue. This primitive helps in discovering a special case of approximate functional dependencies that better detects multiple violations in a small set of clusters. Provided examples illustrate the differences between existing AFD formulation and those PFDs, as well as show some of the potential use cases of PFDs.

* New feature: discovery of inclusion dependencies. This primitive can help users to recover primary key — foreign key relationships, or to find joinable columns in a table or a collection of tables. It is available as an exact algorithm (Spider) and as an approximate one (Faida), with Faida potentially producing errors but being much faster.

* New feature: we extend the set of supported data types by adding graphs. We started with supporting graph functional dependencies (GFD), and Desbordante can now validate GFDs. GFDs allow users to define patterns in graphs, specifying conditions both on graph structure and node content. Graph dependencies can be a bit tricky, so we provide illustrated examples.

* We’ve made discovery of conditional functional dependencies available in Python. This primitive can be considered as:

  1) an another way to define approximate functional dependencies, which, unlike other approaches, offers rich semantics (context), helping in understanding complex cases when the exact FD does not hold,

  2) an AFDs discovery algorithm which provides control over how frequent and how consistent this pattern is, 
  3) a building block for many existing data repair algorithms.

* We’ve also made validation of approximate unique column combinations available in Python. This primitive is suitable for defining keys in tables and for detecting partial duplicates over a subset of columns. As is usually the case with any validation primitive, we additionally provide discovery of exceptions and computation of improved thresholds.

For all introduced primitives, we provide descriptive examples. All primitives are supported in the console version of Desbordante, with the help file containing references to papers in which these primitives are described.

Miscellaneous:

* We have established a github organization and gathered all repositories related to our project in one place.

* We have extended the coverage of the option for limiting the maximum size of the left-hand side to all functional dependency discovery algorithms. This should allow users to speed up the FD discovery if they do not need dependencies with large LHSes.

* We’ve added many new example scripts. Since the project is currently under-documented, we hope this will be helpful for our potential users. You can see them here (https://github.com/Desbordante/desbordante-core/tree/main/ex...).

* To improve our overall documentation level, we have also published several guides — see the papers section (https://desbordante.unidata-platform.ru/papers).

Comments URL: https://news.ycombinator.com/item?id=40063137

Points: 8

# Comments: 2

https://github.com/Desbordante/desbordante-core

Creado 13d | 17 abr. 2024 16:50:16

Inicia sesión para agregar comentarios