-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Current Behaviour
Even when explicitly configuring plot.histogram.bins and setting vars.num.low_categorical_threshold = 0, YData Profiling fails to respect the number of bins for numeric features with low cardinality (e.g. [0.2, 0.4, ..., 1.0]).
This results in bar plots or underspecified histograms, despite numeric treatment being forced. This appears to be a bug, as user configuration should override internal heuristics. The behaviour appears to be due to the following line:
bins_arg = "auto" if hist_config.bins == 0 else min(hist_config.bins, n_unique) |
Expected Behaviour
I would expect that when the number of bins is set explicitly by the user that this is respected.
Alternatively there should at least be some additional configuration variable bins_override
which allows the user to explicitly state the number of bins and override heuristics.
Instead of:
bins_arg = "auto" if hist_config.bins == 0 else min(hist_config.bins, n_unique) |
it would be nice to have something like:
if hist_config.bins_override:
bins_arg = hist_config.bins_override
else:
bins_arg = "auto" if hist_config.bins == 0 else min(hist_config.bins, n_unique)
Data Description
Any numeric dataset with cardinality lower than the number of bins
Code that reproduces the bug
import pandas as pd
from ydata_profiling import ProfileReport
df = pd.DataFrame({"x": [0.2, 0.4, 0.6, 0.8, 1.0] * 20})
profile = ProfileReport(
df,
explorative=True
)
profile.config.vars.num.low_categorical_threshold = 0
profile.config.plot.histogram.bins = 30
profile.to_file("report.html")
pandas-profiling version
v4.16.1
Dependencies
ydata-profiling==4.16.1
OS
No response
Checklist
- There is not yet another bug report for this issue in the issue tracker
- The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
- The issue has not been resolved by the entries listed under Common Issues.