Skip to content

Conversation

mcos-ntakouris
Copy link

@mcos-ntakouris mcos-ntakouris commented Sep 9, 2025

Hello, we are the owners of https://github.com/macrocosm-os/data-universe/commits/main/

Currently, here is our package dependency tree

scalecodec==1.2.11
├── async-substrate-interface==1.5.1 [requires: scalecodec~=1.2.11]
│   └── bittensor==9.9.0 [requires: async-substrate-interface>=1.4.2]
│       └── bittensor_data_universe==1.3.8 [requires: bittensor==9.9.0]
└── bittensor==9.9.0 [requires: scalecodec==1.2.11]
    └── bittensor_data_universe==1.3.8 [requires: bittensor==9.9.0]

During a memory leak investigation, with the following snippet

    def run_memory_monitor(self):
        import gc, collections, os, time
        import datetime as dt

        dump_file = "validator_memory_objects.txt"
        TOP_N = 200               # top N allocation sites to display (tracebacks/files)
        FILE_LIMIT = 100          # top N files by size
        DELTA_LIMIT = 100         # top N growth sites vs previous tick

        n_seconds_between_dumps = 10
        n_seconds_till_last_dump = 0

        prev_snapshot = None  # for delta comparisons

        while not self.should_exit:
            time.sleep(1.0)
            n_seconds_till_last_dump += 1
            if n_seconds_till_last_dump < n_seconds_between_dumps:
                continue
            n_seconds_till_last_dump = 0

            try:
                bt.logging.debug("[memory monitor] gc.collect()")
            except Exception:
                pass

            gc.collect()

            snap = tracemalloc.take_snapshot()

            with open(dump_file, "w", encoding="utf-8") as f:
                now = dt.datetime.now(dt.timezone.utc).strftime('%d/%m/%Y, %H:%M:%S')
                f.write(f"=== Memory Monitor — {now} UTC ===\n")

                f.write("\n\n--- Top allocations by traceback (tracemalloc) ---\n")
                stats_tb = snap.statistics('traceback')
                total_bytes = sum(stat.size for stat in stats_tb)
                total_blocks = sum(stat.count for stat in stats_tb)
                f.write(f"Total tracked: {total_bytes/1024/1024:,.2f} MB across {total_blocks:,} blocks\n")

                for i, stat in enumerate(stats_tb[:TOP_N], 1):
                    size_mb = stat.size / 1024 / 1024
                    f.write(f"{i:4d}. {size_mb:8.2f} MB | {stat.count:7d} blocks\n")
                    # Print the most relevant few frames (last frames are usually app code)
                    frames = stat.traceback.format()
                    for frame in frames[-5:]:
                        f.write(f"      {frame}\n")


                f.write(f"\n\n--- Memory usage by file/module (top {FILE_LIMIT}, tracemalloc) ---\n")
                stats_file = snap.statistics('filename')
                for stat in stats_file[:FILE_LIMIT]:
                    size_mb = stat.size / 1024 / 1024
                    # Prefer the frame filename if present
                    filename = stat.traceback[0].filename if stat.traceback else "<unknown>"
                    f.write(f"{size_mb:8.2f} MB | {stat.count:7d} blocks | {filename}\n")

                if prev_snapshot is not None:
                    f.write("\n\n--- Growth since last tick (top "
                            f"{DELTA_LIMIT}, tracemalloc) ---\n")
                    
                    top_stats = snap.compare_to(prev_snapshot, 'traceback')
                    grew = [st for st in top_stats if st.size_diff > 0]
                    for i, stat in enumerate(grew[:DELTA_LIMIT], 1):
                        size_mb = stat.size_diff / 1024 / 1024
                        f.write(f"{i:4d}. +{size_mb:8.2f} MB | +{stat.count_diff:7d} blocks\n")
                        frames = stat.traceback.format()
                        for frame in frames[-5:]:
                            f.write(f"      {frame}\n")
                else:
                    f.write("\n\n--- Growth since last tick ---\n(first snapshot; deltas start next cycle)\n")

            try:
                bt.logging.debug(f"[memory monitor] wrote {dump_file}")
            except Exception:
                pass

            prev_snapshot = snap

in this validator file: https://github.com/macrocosm-os/data-universe/blob/main/neurons/validator.py

I noticed that he memory allocations of this specific item grows over time, about 100-200MB/h

Here is a sample of our trace:

--- Top allocations by traceback (tracemalloc) ---
Total tracked: 204.62 MB across 1,654,155 blocks
   1.    40.71 MB |  117773 blocks
        File "/root/testnet/data-universe/raovenv/lib/python3.12/site-packages/scalecodec/base.py", line 191
          decoder_class = self.get_decoder_class(type_string)
        File "/root/testnet/data-universe/raovenv/lib/python3.12/site-packages/scalecodec/base.py", line 146
          decoder_class = type(type_string, (base_class,), {'sub_type': type_parts[1]})
        File "<frozen abc>", line 106
   2.    31.57 MB |  103389 blocks
        File "/root/testnet/data-universe/raovenv/lib/python3.12/site-packages/scalecodec/base.py", line 191
          decoder_class = self.get_decoder_class(type_string)
        File "/root/testnet/data-universe/raovenv/lib/python3.12/site-packages/scalecodec/base.py", line 146
          decoder_class = type(type_string, (base_class,), {'sub_type': type_parts[1]})
        File "<frozen abc>", line 106
   3.    12.38 MB |   41169 blocks
        File "/root/testnet/data-universe/raovenv/lib/python3.12/site-packages/scalecodec/base.py", line 191
          decoder_class = self.get_decoder_class(type_string)
        File "/root/testnet/data-universe/raovenv/lib/python3.12/site-packages/scalecodec/base.py", line 146
          decoder_class = type(type_string, (base_class,), {'sub_type': type_parts[1]})
        File "<frozen abc>", line 106
   4.     7.14 MB |   78860 blocks
          lines = getlines(filename, module_globals)
        File "/usr/lib/python3.12/linecache.py", line 46
          return updatecache(filename, module_globals)
        File "/usr/lib/python3.12/linecache.py", line 139
          lines = fp.readlines()
   5.     6.61 MB |   50976 blocks
          field_obj = self.process_type(data_type, metadata=self.metadata)
        File "/root/testnet/data-universe/raovenv/lib/python3.12/site-packages/scalecodec/base.py", line 982
          obj = self.runtime_config.create_scale_object(type_string, self.data, **kwargs)
        File "/root/testnet/data-universe/raovenv/lib/python3.12/site-packages/scalecodec/base.py", line 194
          return decoder_class(data=data, **kwargs)
   6.     6.52 MB |   58624 blocks
          field_obj = self.process_type(data_type, metadata=self.metadata)
        File "/root/testnet/data-universe/raovenv/lib/python3.12/site-packages/scalecodec/base.py", line 982
          obj = self.runtime_config.create_scale_object(type_string, self.data, **kwargs)
        File "/root/testnet/data-universe/raovenv/lib/python3.12/site-packages/scalecodec/base.py", line 194
          return decoder_class(data=data, **kwargs)

--- Memory usage by file/module (top 100, tracemalloc) --- 
85.73 MB | 323449 blocks | <frozen abc> 
42.77 MB | 580717 blocks | /root/testnet/data-universe/raovenv/lib/python3.12/site-packages/scalecodec/base.py 
10.13 MB | 112466 blocks | /usr/lib/python3.12/linecache.py 
8.84 MB | 151935 blocks | /root/testnet/data-universe/raovenv/lib/python3.12/site-packages/scalecodec/types.py 
2.30 MB | 25633 blocks | /usr/lib/python3.12/tracemalloc.py 
1.88 MB | 33208 blocks | /root/testnet/data-universe/storage/validator/sqlite_memory_validator_storage.py 0.73 MB | 14107 blocks | /root/testnet/data-universe/raovenv/lib/python3.12/site-packages/pydantic/main.py

Growing up to:

--- Memory usage by file/module (top 100, tracemalloc) ---
  227.25 MB |  860861 blocks | <frozen abc>
  114.20 MB | 1550823 blocks | /root/testnet/data-universe/raovenv/lib/python3.12/site-packages/scalecodec/base.py
   23.54 MB |  404877 blocks | /root/testnet/data-universe/raovenv/lib/python3.12/site-packages/scalecodec/types.py

steadily.

After applying a patch in this PR, the memory seems to have stabilized or at least significantly slowed down:

--- Memory usage by file/module (top 100, tracemalloc) ---
  110.53 MB | 1904509 blocks | /root/testnet/data-universe/raovenv/lib/python3.12/site-packages/scalecodec/base.py
   39.78 MB |  646299 blocks | /root/testnet/data-universe/raovenv/lib/python3.12/site-packages/scalecodec/types.py
   13.04 MB |  144492 blocks | /usr/lib/python3.12/linecache.py
    2.05 MB |   23764 blocks | /usr/lib/python3.12/tracemalloc.py

The problem seems to be that this dynamically creates a brand-new class for (almost) every parametric type string it encounters.

If you feed lots of distinct type strings (or even repeat them without caching), this hammers the allocator and ABC machinery ( in your stack) by repeatedly constructing subclasses.

We would not need a new class per subtype—you just need a decoder configured with a subtype.

Right now I am not sure if this is a problem here, or a usage problem in one of the upstream packages of async-substrate-interface, or bittensor. This can potentially affect other bittensor subnets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant