完全一致ではなく、部分文字列のテーブルを効率的にマージする

-2

RまたはPythonで2つのテーブルをマージすると、それぞれ数万行になります。しかし、私は完璧なマッチにマージすることはできません。私は、あるキーが別のキーの部分文字列である場合を探しています。一致する部分文字列には複数の単語を含めることができます。 以下の私のブルートフォースコードよりも速いソリューションを探しています。完全一致ではなく、部分文字列のテーブルを効率的にマージする

https://stackoverflow.com/users/170352/brandon-bertelsen私が最初に提案したおもちゃのデータに基づいた素敵な答えをくれました。ただし、1語の部分文字列にのみ一致します。（私はもともとこの要件を明示していませんでした）

これは私がこの状況で使用するコードです。

library(SPARQL) 
library(parallel) 
library(Hmisc) 
library(tidyr) 
library(dplyr) 

my.endpoint <- "http://sparql.hegroup.org/sparql/" 

go.query <- 'select * 
where { graph <http://purl.obolibrary.org/obo/merged/GO> 
{ ?goid 
<http://www.geneontology.org/formats/oboInOwl#hasOBONamespace> 
"biological_process"^^<http://www.w3.org/2001/XMLSchema#string> . 
?goid rdfs:label ?goterm}}' 
go.result <- SPARQL(url = my.endpoint, query = go.query) 
go.result.frame <- go.result[[1]] 

anat.query <- 'select distinct ?anatterm ?anatid 
where { graph <http://purl.obolibrary.org/obo/merged/UBERON> 
{ ?anatid <http://www.geneontology.org/formats/oboInOwl#hasDbXref> ?xr . 
?anatid rdfs:label ?anatterm}}' 

anat.result <- SPARQL(url = my.endpoint, query = anat.query) 
anat.result.frame <- anat.result[[1]] 

# slow but recognizes multi-word substrings 
loop.solution <- 
    mclapply(
    X = sort(anat.result.frame$anatid), 
    mc.cores = 7, 
    FUN = function(one.anat.id) { 
     one.anat.term <- 
     anat.result.frame$anatterm[anat.result.frame$anatid == one.anat.id] 
     temp <- 
     grepl(pattern = paste0('\\b', one.anat.term, '\\b'), 
       x = go.result.frame$goterm) 
     temp <- go.result.frame[temp , ] 
     if (nrow(temp) > 0) { 
     temp$anatterm <- one.anat.term 
     temp$anatid <- one.anat.id 
     return(temp) 
     } 
    } 
) 

loop.solution <- do.call(rbind, loop.solution) 

# from Brandon 
# fast, but doesn't recognize multi-word matches 
sep.gather.soln <- 
    separate(go.result.frame, 
      goterm, 
      letters, 
      sep = " ", 
      remove = FALSE) %>% 
    gather(goid, goterm) %>% 
    na.omit() %>% 
    setNames(c("goid", "goterm", "code", "anatterm")) %>% 
    select(goid, goterm, anatterm) %>% 
    left_join(anat.result.frame) %>% 
    na.omit()

出典

2016-08-05 Mark Miller

セット（用語+」『+集合における猫のための用語で用語のために猫（[辞書に.get（単語、』 '））の単語のためにterm.split（）]）もしcat） – galaxyan

ありがとう。私は、行が途切れ、インデントがこのコードに入るかどうかを調べることに問題があります。 –

これは1行のコードです – galaxyan

library(tidyr) 
library(dplyr) 

df1 <- data.frame(
    mealtime = c("breakfast","lunch","dinner","dinner"), 
    dish = c(
    "cheese omelette", 
    "turkey sandwich", 
    "bean soup", 
    "something very long like this") 
) 

df2 <- read.table(textConnection(
'ingredient category 
bean  legume 
beef  meat 
carrot  vegetable 
cheese  dairy 
milk  dairy 
omelette eggs 
sandwich bread 
turkey  meat'), header = TRUE) 

df1 <- separate(df1, dish, letters, sep = " ", remove = FALSE) %>% 
    gather(mealtime, dish) %>% 
    na.omit() %>% setNames(c("mealtime","dish","code","ingredient")) %>% 
    select(mealtime, dish, ingredient) %>% 
    left_join(df2) %>% na.omit() 

df1

mealtime dish ingredient category 1 breakfast cheese omelette cheese dairy 2 lunch turkey sandwich turkey meat 3 dinner bean soup bean legume 5 breakfast cheese omelette omelette eggs 6 lunch turkey sandwich sandwich bread

出典

2016-08-05 19:08:20

ありがとう、Brandon。申し訳ありませんが、私のサンプルテーブルは不明でした。私の問題の要点は、結合列の間に完全な一致がないということです。したがって、スタンダードのマージは機能しません。 –

いいえ、必ずしも2語ではありません。私は、さまざまな生物医学のオントロジーにマッチさせようとしています。例えば、「腎臓上皮発生の阻害」と「腎臓上皮」との間のマッチング。または "突然の心臓発作"と "心臓"の間。 –

アプローチを変更し、すべての可能性を溶かし、ミスを落とし、マージし、ミスを落とす。 –

私はあなたのオリジナルのポストのデータを使用しています。
第一スプリット用語
第
第三互いに結合辞書に関連した項目を確認

terms =["cheese omelette","turkey sandwich","bean soup",] 
dictionary ={'turkey': 'meat', 'cheese': 'dairy', 'sandwich': 'bread', 'beef': 'meat', 'omelette': 'eggs', 'bean': 'legume', 'carrot': 'vegetable', 'milk': 'dairy'} 

res = set(term +' '+ cat for term in terms for cat in set([ dictionary .get(word,'') for word in term.split()]) if cat) 

for i in res: 
    print i 


output: 
cheese omelette dairy 
bean soup legume 
turkey sandwich meat 
turkey sandwich bread 
cheese omelette eggs

出典

2016-08-12 21:29:52 galaxyan

回答を投稿していただきありがとうございます。はい、辞書ペアの重要な部分が1つの単語の場合はうまく動作します。最初はキーが複数の単語になる可能性があるという私の要求では十分ではなかったので、質問を更新しました。動いているターゲットのために申し訳ありません。現在私のループソリューションをプロダクションで使用していますが、マルチワードキーを使用する提案はまだありません。 –

完全一致ではなく、部分文字列のテーブルを効率的にマージする

答えて

関連する問題