Skip to content

Commit 15a33fa

Browse files
authored
feat(ingest): lookml - support for git checkout (#5924)
1 parent 77d7456 commit 15a33fa

File tree

11 files changed

+226
-30
lines changed

11 files changed

+226
-30
lines changed

.github/workflows/metadata-ingestion.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ jobs:
5252
env:
5353
SPARK_VERSION: 3.0.3
5454
DATAHUB_TELEMETRY_ENABLED: false
55+
DATAHUB_LOOKML_GIT_TEST_SSH_KEY: ${{ secrets.DATAHUB_LOOKML_GIT_TEST_SSH_KEY }}
5556
strategy:
5657
matrix:
5758
python-version: ["3.7", "3.10"]

metadata-ingestion/docs/sources/looker/lookml_pre.md

Lines changed: 40 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,53 @@
11
### Pre-requisites
22

3+
#### [Optional] Create a GitHub Deploy Key
4+
5+
To use LookML ingestion through the UI, or automate github checkout through the cli, you must set up a GitHub deploy key for your Looker GitHub repository. Read [this](https://docs.github.com/en/developers/overview/managing-deploy-keys#deploy-keys) document for how to set up deploy keys for your Looker git repo.
6+
7+
In a nutshell, there are three steps:
8+
1. Generate a private-public ssh key pair. This will typically generate two files, e.g. looker_datahub_deploy_key (this is the private key) and looker_datahub_deploy_key.pub (this is the public key)
9+
![Image](https://github.com/datahub-project/static-assets/blob/main/imgs/gitssh/ssh-key-generation.png)
10+
11+
2. Add the public key to your Looker git repo as a deploy key with read access (no need to provision write access). Follow the guide [here](https://docs.github.com/en/developers/overview/managing-deploy-keys#deploy-keys) for that.
12+
![Image](https://github.com/datahub-project/static-assets/blob/main/imgs/gitssh/git-deploy-key.png)
13+
14+
3. Provision the private key (paste the contents of the file) as a secret on the DataHub Ingestion UI. Give it a handy name like **LOOKER_DATAHUB_DEPLOY_KEY**
15+
![Image](https://github.com/datahub-project/static-assets/blob/main/imgs/gitssh/datahub-secret.png)
16+
317
#### [Optional] Create an API key
418

519
See the [Looker authentication docs](https://docs.looker.com/reference/api-and-integration/api-auth#authentication_with_an_sdk) for the steps to create a client ID and secret.
620
You need to ensure that the API key is attached to a user that has Admin privileges.
721

822
If that is not possible, read the configuration section and provide an offline specification of the `connection_to_platform_map` and the `project_name`.
923

10-
### Ingestion through UI
24+
### Ingestion Options
25+
26+
You have 3 options for controlling where your ingestion of LookML is run.
27+
- The DataHub UI (recommended for the easiest out-of-the-box experience)
28+
- As a GitHub Action (recommended to ensure that you have the freshest metadata pushed on change)
29+
- Using the CLI (scheduled via an orchestrator like Airflow)
30+
31+
Read on to learn more about these options.
32+
33+
#### Ingestion through UI
1134

12-
Ingestion using lookml connector is not supported through the UI.
13-
However, you can set up ingestion using a GitHub Action to push metadata whenever your main lookml repo changes.
35+
To ingest LookML metadata through the UI, you must set up a GitHub deploy key using the instructions in the section [above](#optional-create-a-github-deploy-key). Once that is complete, you can follow the on-screen instructions to set up a LookML source using the Ingestion page.
36+
37+
### Ingestion using GitHub Action
38+
39+
You can set up ingestion using a GitHub Action to push metadata whenever your main lookml repo changes.
40+
The following sample github action file can be modified to emit LookML metadata whenever there is a change to your repository. This ensures that metadata is already fresh and up to date.
1441

1542
#### Sample GitHub Action
1643

1744
Drop this file into your `.github/workflows` directory inside your Looker github repo.
45+
You need to set up the following secrets in your GitHub repository to get this workflow to work:
46+
- DATAHUB_GMS_HOST: The endpoint where your DataHub host is running
47+
- DATAHUB_TOKEN: An authentication token provisioned for DataHub ingestion
48+
- LOOKER_BASE_URL: The base url where your Looker assets are hosted (e.g. https://acryl.cloud.looker.com)
49+
- LOOKER_CLIENT_ID: A provisioned Looker Client ID
50+
- LOOKER_CLIENT_SECRET: A provisioned Looker Client Secret
1851

1952
```
2053
name: lookml metadata upload
@@ -25,12 +58,6 @@ on:
2558
paths-ignore:
2659
- "docs/**"
2760
- "**.md"
28-
pull_request:
29-
branches:
30-
- main
31-
paths-ignore:
32-
- "docs/**"
33-
- "**.md"
3461
release:
3562
types: [published, edited]
3663
workflow_dispatch:
@@ -59,9 +86,9 @@ jobs:
5986
branch: ${{ github.ref }}
6087
# Options
6188
#connection_to_platform_map:
62-
# acryl-snow: snowflake
63-
#platform: snowflake
64-
#default_db: DEMO_PIPELINE
89+
# connection-name:
90+
#platform: platform-name (e.g. snowflake)
91+
#default_db: default-db-name (e.g. DEMO_PIPELINE)
6592
api:
6693
client_id: ${LOOKER_CLIENT_ID}
6794
client_secret: ${LOOKER_CLIENT_SECRET}
@@ -76,7 +103,7 @@ jobs:
76103
env:
77104
DATAHUB_GMS_HOST: ${{ secrets.DATAHUB_GMS_HOST }}
78105
DATAHUB_TOKEN: ${{ secrets.DATAHUB_TOKEN }}
79-
LOOKER_BASE_URL: https://acryl.cloud.looker.com # <--- replace with your Looker base URL
106+
LOOKER_BASE_URL: ${{ secrets.LOOKER_BASE_URL }}
80107
LOOKER_CLIENT_ID: ${{ secrets.LOOKER_CLIENT_ID }}
81108
LOOKER_CLIENT_SECRET: ${{ secrets.LOOKER_CLIENT_SECRET }}
82109
```

metadata-ingestion/docs/sources/looker/lookml_recipe.yml

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,13 @@
11
source:
22
type: "lookml"
33
config:
4+
# GitHub Coordinates: Used to check out the repo locally and add github links on the dataset's entity page.
5+
github_info:
6+
repo: org/repo-name
7+
deploy_key_file: ${LOOKER_DEPLOY_KEY_FILE} # file containing the private ssh key for a deploy key for the looker git repo
8+
49
# Coordinates
5-
base_folder: /path/to/model/files
10+
# base_folder: /path/to/model/files ## Optional if you are not able to provide a GitHub deploy key
611

712
# Options
813
api:
@@ -28,10 +33,6 @@ source:
2833
# default_schema: DEFAULT_SCHEMA # the default schema configured for this connection
2934
# platform_instance: bq_warehouse # optional
3035
# platform_env: DEV # optional
31-
32-
# Optional additional github information. Used to add github links on the dataset's entity page.
33-
github_info:
34-
repo: org/repo-name
3536
# Default sink is datahub-rest and doesn't need to be configured
3637
# See https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for customization options
3738

metadata-ingestion/docs/sources/snowflake/snowflake-usage_recipe.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,5 +22,6 @@ source:
2222
deny:
2323
- "information_schema.*"
2424

25-
sink:
26-
# sink configs
25+
# Default sink is datahub-rest and doesn't need to be configured
26+
# See https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for customization options
27+

metadata-ingestion/docs/sources/snowflake/snowflake_recipe.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,5 +49,5 @@ source:
4949
deny:
5050
- '.*information_schema.*'
5151

52-
sink:
53-
# sink configs
52+
# Default sink is datahub-rest and doesn't need to be configured
53+
# See https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for customization options

metadata-ingestion/setup.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ def get_long_description():
5959
"cached_property",
6060
"ijson",
6161
"click-spinner",
62+
"GitPython>2",
6263
}
6364

6465
kafka_common = {
@@ -135,7 +136,7 @@ def get_long_description():
135136

136137
looker_common = {
137138
# Looker Python SDK
138-
"looker-sdk==22.2.1"
139+
"looker-sdk==22.2.1",
139140
}
140141

141142
bigquery_common = {
Lines changed: 52 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,73 @@
1-
from pydantic import Field, validator
1+
import os
2+
from typing import Any, Dict, Optional
3+
4+
from pydantic import Field, FilePath, SecretStr, validator
25

36
from datahub.configuration.common import ConfigModel, ConfigurationError
47

58

69
class GitHubInfo(ConfigModel):
710
repo: str = Field(
8-
description="Name of your github repo. e.g. repo for https://github.com/datahub-project/datahub is `datahub-project/datahub`."
11+
description="Name of your github repository in org/repo format. e.g. repo for https://github.com/datahub-project/datahub is `datahub-project/datahub`."
912
)
1013
branch: str = Field(
1114
"main",
1215
description="Branch on which your files live by default. Typically main or master.",
1316
)
14-
base_url: str = Field("https://github.com", description="Base url for Github")
17+
base_url: str = Field(
18+
"https://github.com",
19+
description="Base url for Github. Used to construct clickable links on the UI.",
20+
)
21+
22+
deploy_key_file: Optional[FilePath] = Field(
23+
None,
24+
description="A private key file that contains an ssh key that has been configured as a deploy key for this repository. Use a file where possible, else see deploy_key for a config field that accepts a raw string.",
25+
)
26+
deploy_key: Optional[SecretStr] = Field(
27+
None,
28+
description="A private key that contains an ssh key that has been configured as a deploy key for this repository. See deploy_key_file if you want to use a file that contains this key.",
29+
)
30+
31+
repo_ssh_locator: Optional[str] = Field(
32+
None,
33+
description="Auto-inferred from repo as [email protected]:{repo}, but you can override this if needed.",
34+
)
1535

1636
@validator("repo")
1737
def repo_should_be_org_slash_repo(cls, repo: str) -> str:
1838
if "/" not in repo or len(repo.split("/")) != 2:
1939
raise ConfigurationError(
20-
"github repo should be in organization/repo form e.g. acryldata/datahub-helm"
40+
"github repo should be in organization/repo form e.g. datahub-project/datahub"
2141
)
2242
return repo
2343

2444
def get_url_for_file_path(self, file_path: str) -> str:
2545
return f"{self.base_url}/{self.repo}/blob/{self.branch}/{file_path}"
46+
47+
@validator("deploy_key_file")
48+
def deploy_key_file_should_be_readable(
49+
cls, v: Optional[FilePath]
50+
) -> Optional[FilePath]:
51+
if v is not None:
52+
# pydantic does existence checks, we just need to check if we can read it
53+
if not os.access(v, os.R_OK):
54+
raise ValueError(f"Unable to read deploy key file {v}")
55+
return v
56+
57+
@validator("deploy_key", pre=True, always=True)
58+
def deploy_key_filled_from_deploy_key_file(
59+
cls, v: Optional[SecretStr], values: Dict[str, Any]
60+
) -> Optional[SecretStr]:
61+
if v is None:
62+
deploy_key_file = values.get("deploy_key_file")
63+
if deploy_key_file is not None:
64+
with open(deploy_key_file, "r") as fp:
65+
deploy_key = SecretStr(fp.read())
66+
return deploy_key
67+
return v
68+
69+
@validator("repo_ssh_locator", always=True)
70+
def auto_infer_from_repo(cls, v: Optional[str], values: Dict[str, Any]) -> str:
71+
if v is None:
72+
return f"[email protected]:{values.get('repo')}"
73+
return v

metadata-ingestion/src/datahub/ingestion/source/git/__init__.py

Whitespace-only changes.
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
import logging
2+
import os
3+
import pathlib
4+
from pathlib import Path
5+
from uuid import uuid4
6+
7+
import git
8+
from pydantic import SecretStr
9+
10+
logger = logging.getLogger(__name__)
11+
12+
13+
class GitClone:
14+
def __init__(self, tmp_dir):
15+
self.tmp_dir = tmp_dir
16+
17+
def clone(self, ssh_key: SecretStr, repo_url: str) -> Path:
18+
unique_dir = str(uuid4())
19+
keys_dir = f"{self.tmp_dir}/{unique_dir}/keys"
20+
checkout_dir = f"{self.tmp_dir}/{unique_dir}/checkout"
21+
dirs_to_create = [keys_dir, checkout_dir]
22+
for d in dirs_to_create:
23+
os.makedirs(d, exist_ok=True)
24+
git_ssh_identity_file = os.path.join(keys_dir, "ssh_key")
25+
with open(
26+
git_ssh_identity_file,
27+
"w",
28+
opener=lambda path, flags: os.open(path, flags, 0o600),
29+
) as fp:
30+
fp.write(ssh_key.get_secret_value())
31+
32+
git_ssh_cmd = f"ssh -i {git_ssh_identity_file}"
33+
logger.debug("ssh_command=%s", git_ssh_cmd)
34+
logger.info(f"⏳ Cloning repo '{repo_url}', this can take some time...")
35+
git.Repo.clone_from(
36+
repo_url,
37+
checkout_dir,
38+
env=dict(GIT_SSH_COMMAND=git_ssh_cmd),
39+
)
40+
logger.info("✅ Cloning complete!")
41+
return pathlib.Path(checkout_dir)

metadata-ingestion/src/datahub/ingestion/source/looker/lookml_source.py

Lines changed: 47 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,9 @@
44
import pathlib
55
import re
66
import sys
7+
import tempfile
78
from dataclasses import dataclass, field as dataclass_field, replace
9+
from datetime import datetime, timedelta
810
from typing import Any, Dict, Iterable, List, Optional, Set, Tuple, Type
911

1012
import pydantic
@@ -16,6 +18,7 @@
1618
import datahub.emitter.mce_builder as builder
1719
from datahub.configuration import ConfigModel
1820
from datahub.configuration.common import AllowDenyPattern, ConfigurationError
21+
from datahub.configuration.github import GitHubInfo
1922
from datahub.configuration.source_common import EnvBasedSourceConfigBase
2023
from datahub.emitter.mcp import MetadataChangeProposalWrapper
2124
from datahub.ingestion.api.common import PipelineContext
@@ -28,6 +31,7 @@
2831
from datahub.ingestion.api.registry import import_path
2932
from datahub.ingestion.api.source import Source, SourceReport
3033
from datahub.ingestion.api.workunit import MetadataWorkUnit
34+
from datahub.ingestion.source.git.git_import import GitClone
3135
from datahub.ingestion.source.looker.looker_common import (
3236
LookerCommonConfig,
3337
LookerExplore,
@@ -139,8 +143,9 @@ def from_looker_connection(
139143

140144

141145
class LookMLSourceConfig(LookerCommonConfig):
142-
base_folder: pydantic.DirectoryPath = Field(
143-
description="Local filepath where the root of the LookML repo lives. This is typically the root folder where the `*.model.lkml` and `*.view.lkml` files are stored. e.g. If you have checked out your LookML repo under `/Users/jdoe/workspace/my-lookml-repo`, then set `base_folder` to `/Users/jdoe/workspace/my-lookml-repo`."
146+
base_folder: Optional[pydantic.DirectoryPath] = Field(
147+
None,
148+
description="Required if not providing github configuration and deploy keys. A pointer to a local directory (accessible to the ingestion system) where the root of the LookML repo has been checked out (typically via a git clone). This is typically the root folder where the `*.model.lkml` and `*.view.lkml` files are stored. e.g. If you have checked out your LookML repo under `/Users/jdoe/workspace/my-lookml-repo`, then set `base_folder` to `/Users/jdoe/workspace/my-lookml-repo`.",
144149
)
145150
connection_to_platform_map: Optional[Dict[str, LookerConnectionDefinition]] = Field(
146151
None,
@@ -225,9 +230,25 @@ def check_either_project_name_or_api_provided(cls, values):
225230
)
226231
return values
227232

233+
@validator("base_folder", always=True)
234+
def check_base_folder_if_not_provided(
235+
cls, v: Optional[pydantic.DirectoryPath], values: Dict[str, Any]
236+
) -> Optional[pydantic.DirectoryPath]:
237+
if v is None:
238+
github_info: Optional[GitHubInfo] = values.get("github_info", None)
239+
if github_info and github_info.deploy_key:
240+
# We have github_info populated correctly, base folder is not needed
241+
pass
242+
else:
243+
raise ValueError(
244+
"base_folder is not provided. Neither has a github deploy_key or deploy_key_file been provided"
245+
)
246+
return v
247+
228248

229249
@dataclass
230250
class LookMLSourceReport(SourceReport):
251+
git_clone_latency: Optional[timedelta] = None
231252
models_discovered: int = 0
232253
models_dropped: List[str] = dataclass_field(default_factory=list)
233254
views_discovered: int = 0
@@ -970,6 +991,7 @@ def _get_upstream_lineage(
970991
return None
971992

972993
def _get_custom_properties(self, looker_view: LookerView) -> DatasetPropertiesClass:
994+
assert self.source_config.base_folder # this is always filled out
973995
file_path = str(
974996
pathlib.Path(looker_view.absolute_file_path).relative_to(
975997
self.source_config.base_folder.resolve()
@@ -1063,7 +1085,29 @@ def get_project_name(self, model_name: str) -> str:
10631085
f"Could not locate a project name for model {model_name}. Consider configuring a static project name in your config file"
10641086
)
10651087

1066-
def get_workunits(self) -> Iterable[MetadataWorkUnit]: # noqa: C901
1088+
def get_workunits(self) -> Iterable[MetadataWorkUnit]:
1089+
if not self.source_config.base_folder:
1090+
assert self.source_config.github_info
1091+
# we don't have a base_folder, so we need to clone the repo and process it locally
1092+
start_time = datetime.now()
1093+
with tempfile.TemporaryDirectory("lookml_tmp") as tmp_dir:
1094+
git_clone = GitClone(tmp_dir)
1095+
# github info deploy key is always populated
1096+
assert self.source_config.github_info.deploy_key
1097+
assert self.source_config.github_info.repo_ssh_locator
1098+
checkout_dir = git_clone.clone(
1099+
ssh_key=self.source_config.github_info.deploy_key,
1100+
repo_url=self.source_config.github_info.repo_ssh_locator,
1101+
)
1102+
self.reporter.git_clone_latency = datetime.now() - start_time
1103+
self.source_config.base_folder = checkout_dir.resolve()
1104+
yield from self.get_internal_workunits()
1105+
else:
1106+
yield from self.get_internal_workunits()
1107+
1108+
def get_internal_workunits(self) -> Iterable[MetadataWorkUnit]: # noqa: C901
1109+
assert self.source_config.base_folder
1110+
10671111
viewfile_loader = LookerViewFileLoader(
10681112
str(self.source_config.base_folder), self.reporter
10691113
)

0 commit comments

Comments
 (0)