用left_join连接两个数据帧

如何解决用left_join连接两个数据帧

我正在尝试在R中使用两个数据帧(df_adf_b)(本质上我想用df_a中包含的更新数据重新填充df_b)。 df_b中的列都出现在df_a中。在df_b中,ref_transcript_nameref_transcript_idref_gene_name中有(重要的)冗余,但是qry_transcript_id的所有值都是唯一的,并且具有一对一的关系。与df_a的一种关系。我的假设是left_join()可以解决问题。我尝试过:

  1. df_c <- left_join(df_a,df_b)-此处df_cdf_b
  2. df_c <- left_join(df_a,df_b,by = "qry_transcript_id")-此处df_c包含df_b的三个非指南列,作为df_c的新列。

在这里,我显然缺少一些有关联接函数的基本知识,但是本质上,我想用df_a中的值填充df_b中的(大部分)缺失值。

这是我的数据:

dput(df_a)
structure(list(ref_gene_id = c("LOC108906895",NA,"LOC108906894","LOC108906889","LOC108906897","LOC108906891","LOC108906890","LOC108906896","LOC108906893","LOC108906892","LOC108905349","LOC108905394","LOC108905439","LOC108905350","LOC108905395","LOC108905377","LOC108905399","LOC108905452","LOC108905450","LOC108905425","LOC108905427","LOC108905429","LOC108905426","LOC108905352","LOC108905375","LOC108905391",NA),qry_gene_id = structure(c(1L,2L,3L,4L,5L,6L,7L,8L,9L,10L,11L,12L,13L,14L,15L,16L,17L,18L,19L,20L,21L,23L,22L,24L,27L,25L,26L,28L,29L,30L,31L,32L),.Label = c("G229","G230","G232","G233","G234","G235","G236","G237","G238","G239","G240","G241","G242","G243","G244","G245","G246","G247","G248","G249","G250","G251","G252","G253","G254","G255","G256","G257","G258","G259","G260","G261"),class = "factor"),ref_gene_name = c("uncharacterized LOC108906895","uncharacterized LOC108906894","myosin regulatory light chain sqh","sushi,von Willebrand factor type A,EGF and pentraxin domain-containing protein 1","uncharacterized LOC108906891","protein twisted gastrulation","paraplegin","fork head domain-containing protein crocodile","forkhead box protein F1-like","centrosomal protein of 135 kDa-like","nuclear transcription factor Y subunit alpha","homeodomain-interacting protein kinase 2","myb-like protein X","uncharacterized LOC108905395","uncharacterized LOC108905377","uncharacterized LOC108905399","uncharacterized LOC108905452","uncharacterized LOC108905450","uncharacterized LOC108905425","N-alpha-acetyltransferase 38,NatC auxiliary subunit","cytochrome c oxidase assembly factor 6 homolog","N-alpha-acetyltransferase 30A","ESF1 homolog","atypical kinase COQ8B,mitochondrial","calphotin-like",gene_annotation = c("refseq","novel","refseq","novel"),ref_transcript_id = c("XR_001964310.2","XR_001964308.1","XM_018710327.1","XM_018710334.2","XM_018710330.2","XM_018710328.1","XM_018710333.1","XM_018710332.1","XM_018710331.1","XM_018708179.2","XM_018708228.2","XM_018708229.2","XM_018708292.1","XM_023457437.1","XM_018708299.1","XM_018708180.1","XM_018708231.1","XM_018708208.2","XM_023453940.1","XM_018708235.2","XM_018708321.1","XM_018708319.1","XM_018708318.1","XM_018708263.1","XM_018708266.1","XM_018708267.1","XM_018708265.1","XM_018708181.2","XM_018708205.1","XM_018708226.1",qry_transcript_id = structure(c(1L,32L,33L,34L,35L,36L,37L,38L,39L,40L,41L,43L,44L,45L,42L,46L,49L,47L,48L,50L,51L,52L,53L,54L),.Label = c("TU429","TU430","TU431","TU435","TU436","TU437","TU438","TU439","TU440","TU441","TU442","TU443","TU444","TU445","TU446","TU447","TU448","TU449","TU450","TU451","TU452","TU453","TU454","TU455","TU456","TU457","TU458","TU459","TU460","TU461","TU462","TU463","TU464","TU465","TU466","TU467","TU468","TU469","TU470","TU471","TU472","TU473","TU474","TU475","TU476","TU477","TU478","TU479","TU480","TU481","TU482","TU483","TU484","TU485"),ref_transcript_name = structure(c(30L,1L,.Label = c("atypical kinase COQ8B,"cytochrome c oxidase assembly factor 6 homolog,transcript variant X1","homeodomain-interacting protein kinase 2,transcript variant X11",transcript variant X8","nuclear transcription factor Y subunit alpha,transcript variant X2","uncharacterized LOC108905377,"uncharacterized LOC108905425,"uncharacterized LOC108905450,"uncharacterized LOC108905452,"uncharacterized LOC108906895,transcript variant X2"),transcript_annotation = c("refseq",class_code = structure(c(1L,5L),.Label = c("=","c","j","k","u"),class = "factor")),row.names = 432:485,class = "data.frame")
dput(df_b)
structure(list(qry_transcript_id = structure(1:10,.Label = c("TU118","TU151","TU255","TU417","TU485","TU543","TU687","TU807"),ref_transcript_name = structure(c(8L,3L),.Label = c("apoptosis-stimulating of p53 protein 1 isoform X3","basic proline-rich protein-like","microtubule-associated protein 10-like","protein dopey homolog PFC0245c-like","protein sprint isoform X2","serine/arginine repetitive matrix protein 2-like","tigger transposable element-derived protein 1-like","uncharacterized protein LOC108904829"),ref_transcript_id = c("XP_018563024","XP_023014054","XP_019880584","XP_018578361","XP_024947529","XP_024947524","XP_030753146","XP_018575004","XP_023028347"
),ref_gene_name = c("* uncharacterized protein LOC108904829","* apoptosis-stimulating of p53 protein 1 ","* basic proline-rich protein-like","* tigger transposable element-derived protein 1-like","* protein dopey homolog PFC0245c-like","* protein sprint ","* serine/arginine repetitive matrix protein 2-like","* microtubule-associated protein 10-like"
)),row.names = c(NA,10L),class = "data.frame")

希望我的子集在这里不会造成问题,但是在完整数据集中,qry_transcript_ids中的所有df_b都包含在df_a中。

解决方法

left_join将所有数据保留在第一个数据帧中。本质上,如果df_b中的列都在df_a内,则它什么都不做,如您在第一种情况下所示:

df_c <- left_join(df_a,df_b)

另一方面,在第二个示例中,联接位于“ qry_transcript_id”上。在这种情况下,除“ qry_transcript_id”以外的其他列均被视为与df_a中的不同。因此,“。y”添加到了他们。

df_c <- left_join(df_a,df_b,by = "qry_transcript_id")

听起来您想要的可能是inner_join。

,

您可以将mutatecoalesceleft_join一起使用来满足合并要求。请尝试以下示例。

 x <- data.frame(Id  = c("A","B","C","E"),X1  = c(1L,3L,5L,7L,NA),XY  = c("x2","x4","x6","x8",XZ  = c("x2",NA,"x10"))
 
 y <- data.frame(Id  = c("A","D",Y1  = c(1L,9L),XY  = c("y1","y3","y5","y7","y9"),XZ  = c("y1","y9"))
 
aa <- x %>% left_join(y,by="Id") %>% 
            mutate(XY = coalesce(XY.x,XY.y)) %>% 
            mutate(XZ = coalesce(XZ.x,XZ.y)) %>% select(Id,X1,XY,XZ)

> aa 
  Id X1 XY   XZ
1  A  1 x2   x2
2  B  3 x4   y3
3  B  3 x4   y5
4  C  5 x6 <NA>
5  C  7 x8   x8
6  E NA y9  x10

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 &lt;select id=&quot;xxx&quot;&gt; SELECT di.id, di.name, di.work_type, di.updated... &lt;where&gt; &lt;if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 &lt;property name=&quot;dynamic.classpath&quot; value=&quot;tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-