Skip to content

Conversation

acryl-hyejin
Copy link
Collaborator

This PR extends the OpenAPI source connector to provide broader support for API metadata ingestion.

Key Changes & Rationale:

  • Support for PUT, POST, and PATCH methods: Previously, the connector primarily focused on GET operations. This change allows for the ingestion of metadata for non-GET endpoints, providing a more comprehensive view of an API's surface. To ensure safety, actual API calls are not made for these methods; instead, metadata is derived from the OpenAPI specification itself.
  • New get_operations_only configuration property: This boolean flag (defaulting to true for backward compatibility) allows users to explicitly enable or disable the processing of non-GET methods. This gives users control over the scope of metadata ingestion.
  • JSON schema reading from Swagger definitions: To improve metadata quality and coverage, the connector now extracts detailed field information directly from OpenAPI/Swagger schema definitions (including $ref references, nested objects, and arrays). This is particularly useful for non-GET methods where example data might be absent or actual API calls are not performed.
  • Updated audit stamp user: The default audit stamp user has been changed from urn:li:corpuser:etl to urn:li:corpuser:datahub to align with standard DataHub installations where the etl user may not exist by default.
  • Updated documentation and example recipes: The openapi.md documentation and openapi_recipe.yml have been updated to reflect these new capabilities and guide users on how to configure them. A new integration test recipe (openapi_extended_to_file.yml) has been added to demonstrate the extended functionality.

This enhancement significantly improves the utility of the OpenAPI source connector by providing more complete and accurate metadata for a wider range of API operations.

Related Issue: OSS-416


Linear Issue: OSS-416

Open in Cursor Open in Web

Copy link

cursor bot commented Sep 5, 2025

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Sep 5, 2025
Copy link

codecov bot commented Sep 5, 2025

❌ 5 Tests Failed:

Tests completed Failed Passed Skipped
768 5 763 39
View the top 3 failed test(s) by shortest run time
tests.integration.dremio.test_dremio::test_dremio_platform_instance_urns
Stack Traces | 0.001s run time
docker_compose_runner = <function docker_compose_runner.<locals>.run at 0x7f183bdf4e00>
pytestconfig = <_pytest.config.Config object at 0x7f1918107e90>
test_resources_dir = PosixPath('.../tests/integration/dremio')

    @pytest.fixture(scope="module")
    def mock_dremio_service(docker_compose_runner, pytestconfig, test_resources_dir):
        # Spin up Dremio and MinIO (for mock S3) services using Docker Compose.
        with docker_compose_runner(
            test_resources_dir / "docker-compose.yml", "dremio"
        ) as docker_services:
            wait_for_port(docker_services, "dremio", 9047, timeout=120)
            wait_for_port(
                docker_services,
                "minio",
                MINIO_PORT,
                timeout=120,
                checker=lambda: is_minio_up("minio"),
            )
            wait_for_port(
                docker_services,
                "test_mysql",
                MYSQL_PORT,
                timeout=120,
                checker=lambda: is_mysql_up("test_mysql", MYSQL_PORT),
            )
    
            # Ensure the admin and data setup scripts have the right permissions
            subprocess.run(
                ["chmod", "+x", f"{test_resources_dir}/setup_dremio_admin.sh"], check=True
            )
    
            # Run the setup_dremio_admin.sh script
            admin_setup_cmd = f"{test_resources_dir}/setup_dremio_admin.sh"
            subprocess.run(admin_setup_cmd, shell=True, check=True)
    
>           install_mysql_client("dremio")

.../integration/dremio/test_dremio.py:443: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

container_name = 'dremio'

    def install_mysql_client(container_name: str) -> None:
        """
        This is bit hacky to install mysql-client and connect mysql to start-mysql in container
        """
    
        command = f'docker exec --user root {container_name} sh -c  "apt-get update && apt-get install -y mysql-client && ....../usr/bin/mysql -h test_mysql -u root -prootpwd123"'
        ret = subprocess.run(command, shell=True, stdout=subprocess.DEVNULL)
>       assert ret.returncode == 0
E       assert 100 == 0
E        +  where 100 = CompletedProcess(args='docker exec --user root dremio sh -c  "apt-get update && apt-get install -y mysql-client && ....../usr/bin/mysql -h test_mysql -u root -prootpwd123"', returncode=100).returncode

.../integration/dremio/test_dremio.py:60: AssertionError
tests.integration.dremio.test_dremio::test_dremio_schema_filter
Stack Traces | 0.001s run time
docker_compose_runner = <function docker_compose_runner.<locals>.run at 0x7f183bdf4e00>
pytestconfig = <_pytest.config.Config object at 0x7f1918107e90>
test_resources_dir = PosixPath('.../tests/integration/dremio')

    @pytest.fixture(scope="module")
    def mock_dremio_service(docker_compose_runner, pytestconfig, test_resources_dir):
        # Spin up Dremio and MinIO (for mock S3) services using Docker Compose.
        with docker_compose_runner(
            test_resources_dir / "docker-compose.yml", "dremio"
        ) as docker_services:
            wait_for_port(docker_services, "dremio", 9047, timeout=120)
            wait_for_port(
                docker_services,
                "minio",
                MINIO_PORT,
                timeout=120,
                checker=lambda: is_minio_up("minio"),
            )
            wait_for_port(
                docker_services,
                "test_mysql",
                MYSQL_PORT,
                timeout=120,
                checker=lambda: is_mysql_up("test_mysql", MYSQL_PORT),
            )
    
            # Ensure the admin and data setup scripts have the right permissions
            subprocess.run(
                ["chmod", "+x", f"{test_resources_dir}/setup_dremio_admin.sh"], check=True
            )
    
            # Run the setup_dremio_admin.sh script
            admin_setup_cmd = f"{test_resources_dir}/setup_dremio_admin.sh"
            subprocess.run(admin_setup_cmd, shell=True, check=True)
    
>           install_mysql_client("dremio")

.../integration/dremio/test_dremio.py:443: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

container_name = 'dremio'

    def install_mysql_client(container_name: str) -> None:
        """
        This is bit hacky to install mysql-client and connect mysql to start-mysql in container
        """
    
        command = f'docker exec --user root {container_name} sh -c  "apt-get update && apt-get install -y mysql-client && ....../usr/bin/mysql -h test_mysql -u root -prootpwd123"'
        ret = subprocess.run(command, shell=True, stdout=subprocess.DEVNULL)
>       assert ret.returncode == 0
E       assert 100 == 0
E        +  where 100 = CompletedProcess(args='docker exec --user root dremio sh -c  "apt-get update && apt-get install -y mysql-client && ....../usr/bin/mysql -h test_mysql -u root -prootpwd123"', returncode=100).returncode

.../integration/dremio/test_dremio.py:60: AssertionError
tests.integration.openapi.test_openapi::test_openapi_ingest
Stack Traces | 0.516s run time
Metadata files differ (use `pytest --update-golden-files` to update):
Urn changed, urn:li:dataset:(urn:li:dataPlatform:OpenApi,test_openapi.root,PROD):
<institutionalMemory> changed:
	Value of aspect['elements'][0]['createStamp']['actor'] changed from "urn:li:corpuser:etl" to "urn:li:corpuser:datahub".

Urn changed, urn:li:dataset:(urn:li:dataPlatform:OpenApi,test_openapi.v2,PROD):
<institutionalMemory> changed:
	Value of aspect['elements'][0]['createStamp']['actor'] changed from "urn:li:corpuser:etl" to "urn:li:corpuser:datahub".
tests.integration.openapi.test_openapi::test_openapi_3_1_ingest
Stack Traces | 0.645s run time
Metadata files differ (use `pytest --update-golden-files` to update):
Urn changed, urn:li:dataset:(urn:li:dataPlatform:OpenApi,test_openapi.board,PROD):
<institutionalMemory> changed:
	Value of aspect['elements'][0]['createStamp']['actor'] changed from "urn:li:corpuser:etl" to "urn:li:corpuser:datahub".
<schemaMetadata> added

Urn changed, urn:li:dataset:(urn:li:dataPlatform:OpenApi,test_openapi.board.{row}.{column},PROD):
<institutionalMemory> changed:
	Value of aspect['elements'][0]['createStamp']['actor'] changed from "urn:li:corpuser:etl" to "urn:li:corpuser:datahub".
<datasetProperties> changed:
	Value of aspect['description'] changed from "Places a mark on the board and retrieves the whole board and the winner (if any)." to "Retrieves the requested square.".
tests.integration.dremio.test_dremio::test_dremio_ingest
Stack Traces | 184s run time
docker_compose_runner = <function docker_compose_runner.<locals>.run at 0x7f183bdf4e00>
pytestconfig = <_pytest.config.Config object at 0x7f1918107e90>
test_resources_dir = PosixPath('.../tests/integration/dremio')

    @pytest.fixture(scope="module")
    def mock_dremio_service(docker_compose_runner, pytestconfig, test_resources_dir):
        # Spin up Dremio and MinIO (for mock S3) services using Docker Compose.
        with docker_compose_runner(
            test_resources_dir / "docker-compose.yml", "dremio"
        ) as docker_services:
            wait_for_port(docker_services, "dremio", 9047, timeout=120)
            wait_for_port(
                docker_services,
                "minio",
                MINIO_PORT,
                timeout=120,
                checker=lambda: is_minio_up("minio"),
            )
            wait_for_port(
                docker_services,
                "test_mysql",
                MYSQL_PORT,
                timeout=120,
                checker=lambda: is_mysql_up("test_mysql", MYSQL_PORT),
            )
    
            # Ensure the admin and data setup scripts have the right permissions
            subprocess.run(
                ["chmod", "+x", f"{test_resources_dir}/setup_dremio_admin.sh"], check=True
            )
    
            # Run the setup_dremio_admin.sh script
            admin_setup_cmd = f"{test_resources_dir}/setup_dremio_admin.sh"
            subprocess.run(admin_setup_cmd, shell=True, check=True)
    
>           install_mysql_client("dremio")

.../integration/dremio/test_dremio.py:443: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

container_name = 'dremio'

    def install_mysql_client(container_name: str) -> None:
        """
        This is bit hacky to install mysql-client and connect mysql to start-mysql in container
        """
    
        command = f'docker exec --user root {container_name} sh -c  "apt-get update && apt-get install -y mysql-client && ....../usr/bin/mysql -h test_mysql -u root -prootpwd123"'
        ret = subprocess.run(command, shell=True, stdout=subprocess.DEVNULL)
>       assert ret.returncode == 0
E       assert 100 == 0
E        +  where 100 = CompletedProcess(args='docker exec --user root dremio sh -c  "apt-get update && apt-get install -y mysql-client && ....../usr/bin/mysql -h test_mysql -u root -prootpwd123"', returncode=100).returncode

.../integration/dremio/test_dremio.py:60: AssertionError

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants