Skip to content

Conversation

Checkmate544
Copy link
Contributor

fix: (xinference) C:\Windows\System32>xinference-local --host 127.0.0.1 -p 9997
Traceback (most recent call last):
File "C:\tools\anaconda3\envs\xinference\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\tools\anaconda3\envs\xinference\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\tools\anaconda3\envs\xinference\Scripts\xinference-local.exe_main
.py", line 7, in
sys.exit(local())
File "C:\tools\anaconda3\envs\xinference\lib\site-packages\click\core.py", line 1161, in call
return self.main(*args, **kwargs)
File "C:\tools\anaconda3\envs\xinference\lib\site-packages\click\core.py", line 1082, in main
rv = self.invoke(ctx)
File "C:\tools\anaconda3\envs\xinference\lib\site-packages\click\core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\tools\anaconda3\envs\xinference\lib\site-packages\click\core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "C:\tools\anaconda3\envs\xinference\lib\site-packages\xinference\deploy\cmdline.py", line 228, in local
start_local_cluster(
File "C:\tools\anaconda3\envs\xinference\lib\site-packages\xinference\deploy\cmdline.py", line 115, in start_local_cluster
main(
File "C:\tools\anaconda3\envs\xinference\lib\site-packages\xinference\deploy\local.py", line 123, in main
raise RuntimeError("Cluster is not available after multiple attempts")
RuntimeError: Cluster is not available after multiple attempts
2025-06-16 14:22:07,632 xinference.core.supervisor 11204 INFO Xinference supervisor 127.0.0.1:10591 started
2025-06-16 14:22:07,678 xinference.core.worker 11204 INFO Starting metrics export server at 127.0.0.1:None
2025-06-16 14:22:07,684 xinference.core.worker 11204 INFO Checking metrics export server...
2025-06-16 14:22:13,178 xinference.core.worker 11204 INFO Metrics server is started at: http://127.0.0.1:65388
2025-06-16 14:22:13,178 xinference.core.worker 11204 INFO Purge cache directory: C:\Users\hjpxm.xinference\cache
2025-06-16 14:22:13,178 xinference.core.worker 11204 INFO Connected to supervisor as a fresh worker
2025-06-16 14:22:13,210 xinference.core.worker 11204 INFO Xinference worker 127.0.0.1:10591 started

@XprobeBot XprobeBot added the bug Something isn't working label Jun 16, 2025
@XprobeBot XprobeBot added this to the v1.x milestone Jun 16, 2025
@qinxuye
Copy link
Contributor

qinxuye commented Jun 17, 2025

Thanks.

I am thinking if we can find more elegant way, to know if the cluster indeed can start.

@qinxuye
Copy link
Contributor

qinxuye commented Jun 18, 2025

I removed the time.sleep, and try to get information from subprocess before doing health check, it will try to wait for 10 seconds by default to get readiness or failure info.

Can you try if this modification works well for you?

@qinxuye qinxuye changed the title fix: improve local cluster startup reliability with extended health c… ENH: improve local cluster startup reliability via child-process readiness signaling Jun 19, 2025
@XprobeBot XprobeBot added enhancement New feature or request and removed bug Something isn't working labels Jun 19, 2025
@qinxuye qinxuye merged commit c153d23 into xorbitsai:main Jun 19, 2025
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants