COMP5338: Advanced Data Models


School of Computer Science
COMP5338: Advanced Data Models 2.Sem./2019
Project: NoSQL Schema Design and Query Workload
Implementation
Group Work: 20% 20.09.2019
1 Introduction
In this assignment,you will demonstrate that you are able to work with both MongoDB
and Neo4j in terms of designing suitable schema and writing practical queries. You will also
demonstrate that you understand the strength and weakness of each system with respect to
certain query workload features. You will be given a real world data set in Question and
Answer area and a set of target queries.
The summary of your tasks are as follows:
• Design a schema for each storage system based on data and query feature;
• Implement all queries in each system;
• For each system,select two queries to provide an alternative implementation; Compare
the execution performance of the two implementations;
• Document the schema design,query design and performance analysis in a report;
2 Data set
The data that you will use is the latest dump (publication date: 2018-06-05) of the Artificial
Intelligence Stack Exchange question and answer site (https://ai.stackexchange.
com/). The dump is released and maintained by stackexchange: https://archive.org/

COMP5338作业代写、SQL程序语言作业调试
details/stackexchange. The original dump contains many files in XML format. The assignment
uses a subset of the data stored in five csv files. The data files and the description
(readme.txt) can be downloaded from Canvas.
The assignment data set contains the following files:
• Posts.csv stores information about post; each row represents a post,which could be
a question or an answer
• Users.csv stores user’s profile; each row represents a user,a user can be the author
of a post or an answer.
1
• Comments.csv stores comments meta data; each row represents a comment,which can
be made for a question or an answer,identified by the PostId
• Votes.csv stores detailed vote information about post,each row represents a vote,
including the vote type,the date this vote is made and a few other information
• Tags.csv contains summary of tag usage in this site.
Two concepts that will appear in many query descriptions are: Topic and User.
• Topic: Each question may belong to a few topics. The topic(s) of a question are
recorded as a list of keywords in the Tags column in Posts.csv. Both answers and
comments belong to this questions have the same topic(s) as the question.
• User: Questions,answers and comments are all made by registered users. Users are
identified by UserId field in various CSV files. Some users are removed for various
reasons. The removed users no longer have an Id and should be ignored in all queries
3 Target Queries
• [Q1] Find the question that attracts most discussions in a given topic; We measure
the intensity of discussion by the total number of answers and comments in a question.
• [Q2] Find the user with the highest UpVote number in a given topic,return the user’s
name and UpVote number. Any user who has posted a question,an answered or a
comment in this topic are candidate users.
• [Q3] For a given topic,discover the questions that are hardest to answer. Here we
measure the difficulty of question by the time it takes to receive an accepted answer.
Questions that do not have an accepted answer will be ignored.
• [Q4] Discover questions with arguable accepted answer. Users can give upVote to both
question and answer. Usually the accepted answer of a question receives the highest
number of upVote among all answers of this question. In rare case,another answer(s)
may receive higher upVote count than the upVote count of the accepted answer. In
this query,you are asked to discover such questions whose accepted answer has less
upVote than the upVote counts of its other answers. Note We are only interested in
questions with at least 5 answers.
• [Q5] Given a time period as indicated by starting and ending date,find the top 5
topics in that period. We rank a topic by the number of users participated in that
topic during the period. Posting question,answering or commenting are all considered
as participation.
2
• [Q6] Find the top 5 co-authors of a given user. Consider all users involved in a
question as co-authors. This include users posting the question,answering the question
or making comments on either question or answers. For a given user,we rank the
coauthors by the number of questions this user and the coauthor appear together.
4 Task Details
Your tasks include:
• Schema Design for MongoDB and Neo4j
For each storage option design a proper schema that would best support the query and
data set feature. For each schema version,make sure you utilize features of the storage
system such as indexing,aggregation,ordering,filtering and so on.
The original data set follows relational structure. It may contain data that are not
useful or not involved in the query. During schema design,you may discard data that
are not needed. You may duplicate original data following the schema design.
• Query Design and Implementation
Load the data set (after some necessary preporocessing) into both systems and set up
proper indexes that will be used by the target queries. Design and implement all queries
in each system. You may implement a query using the shell command (e.g. MongoDB
shell or Cypher query) alone,or a combination of JavaScript and shell commands in the
case of MongoDB or as Python/Java program. In case that a programming language is
used,make sure that you do majority of the processing on the database side. The client
side processing should be restricted to activities like collecting output from previous
database query and send the output as is to the subsequent one. In particular,you
should avoid sorting,filtering and grouping query output on the client side.
• Performance Analysis
For each storage option (MongoDB and Neo4j),pick two queries as the performance
analysis target queries. Design a different implementation for each query. Then collect
execution statistics of each implementation and make a side by side comparison.
Deliverable and Submission Guideline
This is a group project,each group can have up to 2 students. Each group needs to produce
the following:
• A Written Report.
The report should contain five sections. The first section is a brief introduction of
the project. Section two and three should cover a storage option each. Section four
3
should provide a summary and brief comparison of the two storage systems. Section
five should be an appendix for sample results.
There is no point allocated on section one. It is included to make the report complete.
So please keep it short.
Section two and three should each contain the following three sub sections
– Schema Design
In this section,describe the schema with respect to the particular system. Your
description should include information at “table” and “column” level as well as
possible primary keys/row keys and secondary indexes. You should show sample
data based on schema. For instance,you may show sample documents of
each MongoDB collection,a sample property graph involving all node types and
relationship types for Neo4j.
– Query Design
In this section,describe implementation of each query. You should include the
entire command and/or code. For each query,briefly explain the behaviour.
Example of MongoDB query description can be found in week 2 and week3 labs
instructions.
– Performance Analysis
In this section,list the two queries you have chosen for performance analysis.
For each query,include the entire command/or code for each implementation.
Show the execution statistics in tabular format or as screenshots. Give a brief
comparison by highlighting the important execution differences.
In section four,briefly compare the two storage systems with respect to ease of use,
query design and schema differences. You can also describe problems encountered in
schema design or query design.
In section five,document the sample query results as well as the respective argument(s)
you use for queries that take argument. This would include: a sample ’topic’ in Q1,
Q2,Q3; a sample period in Q5,a simple userId in Q6.
• System Demo
Each group will demo in week 10 lab. You can run demo on your own machine,on
lab machine or on some cloud servers. Please make sure you prepare the data before
the demo. The marker does not need to see your data loading steps. The marker will
ask you to run a few randomly selected queries to get an overview of the data model
and query design. All members of the group are required to attend the demo. The
marker will ask each member a few questions to establish their respective contribution
to the project. Members in the same group may get different marks depending on their
individual contributions.
4
• Source Code/Script and soft copy of report submission
There will be different links for script (zip file) and report (PDF file) submission to
facilitate plagiarism detection. The script submission should be a zip file (no rar,7z)
include the following:
– query script or program code for each option
– data loading script.
– a Readme document for how to run the data loading script and the target queries.
The instruction should be detailed enough for the markers to quickly prepare the
data and to run the queries. For instance,you should indicate where and how
run-time argument are supplied. If you use special features only available in a
particular version or environment,indicate that as well.
Remember,only script or source code and read.me file should be included. There will
be penalty for including data file in the submission.

因为专业,所以值得信赖。如有需要,请加QQ99515681 或邮箱:[email protected] 

微信:codehelp

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


文章浏览阅读752次。关系型数据库关系型数据库是一个结构化的数据库,创建在关系模型(二维表模型)基础上,一般面向于记录SQL语句(标准数据查询语言)就是一种基于关系型数据库的语言,用于执行对关系型数据库中数据的检索和操作主流的关系数据库包括Oracle、Mysql、SQL Server、Microsoft Access、DB2等非关系型数据库NoSQL(nOSQL=Not Only SQL),意思是“不仅仅是SQL”,是非关系型数据库的总称。除了主流的关系型数据库外的数据库,都认为是非关系型主流的NoSQ.._redis是非关系型数据库吗
文章浏览阅读687次,点赞2次,收藏5次。商城系统中,抢购和秒杀是很常见的营销场景,在一定时间内有大量的用户访问商场下单,主要需要解决的问题有两个:1. 高并发对数据库产生的压力;2. 竞争状态下如何解决商品库存超卖;高并发对数据库产生的压力对于第一个问题,使用缓存来处理,避免直接操作数据库,例如使用 Redis。竞争状态下如何解决商品库存超卖对于第二个问题,需要重点说明。常规写法:查询出对应商品的库存,判断库存数量否大于 0,然后执行生成订单等操作,但是在判断库存是否大于 0 处,如果在高并发下就会有问题,导致库存_php库存结余并发
文章浏览阅读1.4k次。MongoTemplate开发spring-data-mongodb提供了MongoTemplate和MongoRepository两种方式访问MongoDB,MongoRepository的方式访问较为简单,MongoTemplate方式较为灵活,这两种方式在Java对于MongoDB的运用中相辅相成。_springboot插入指定的mongodb数据库
文章浏览阅读887次,点赞10次,收藏19次。1.背景介绍1. 背景介绍NoSQL数据库是一种非关系型数据库,它的特点是可以存储非结构化的数据,并且可以处理大量的数据。HBase是一个分布式、可扩展的列式存储系统,它是基于Google的Bigtable设计的。HBase是一个开源的NoSQL数据库,它的核心功能是提供高性能的随机读写访问。在本文中,我们将对比HBase与其他NoSQL数据库,例如Redis、MongoDB、Cass...
文章浏览阅读819次。MongoDB连接失败记录_edentialmechanisn-scram-sha-1
文章浏览阅读470次。mongodb抽取数据到ES,使用ELK内部插件无法获取数据,只能试试monstache抽取mongodb数据,但是monstache需要mongodb replica set 模式才能采集数据。############monstache-compose文件。#replicas set 启动服务。# 默认备份节点不能读写,可以设置。# mydb指的是需要同步的数据库。#登录主mongodb初始化rs。#primary 创建用户。# ip地址注意要修改。# ip地址注意要修改。_monstache csdn
文章浏览阅读913次,点赞4次,收藏5次。storage:fork: trueadmin登录切换数据库注意: use 代表创建并使用,当库中没有数据时默认不显示这个库删除数据库查看表清单> show tables # 或者 > show collections表创建db.createCollection('集合名称', [options])table1字段类型描述capped布尔(可选)如果为 true,则创建固定集合。固定集合是指有着固定大小的集合,当达到最大值时,它会自动覆盖最早的文档。_mongodb5
文章浏览阅读862次。Centos7.9设置MongoDB开机自启(超全教程,一条龙)_mongodb centos开机启动脚本
文章浏览阅读1.3k次,点赞6次,收藏21次。NoSQL数据库使用场景以及架构介绍
文章浏览阅读856次,点赞21次,收藏20次。1.背景介绍1. 背景介绍NoSQL数据库是一种非关系型数据库,它的设计目标是为了解决传统关系型数据库(如MySQL、Oracle等)在处理大量不结构化数据方面的不足。NoSQL数据库可以处理大量数据,具有高性能、高可扩展性和高可用性。但是,与关系型数据库不同,NoSQL数据库没有固定的模式,数据结构也不一定是表格。在NoSQL数据库中,数据存储和查询都是基于键值对、列族、图形等不同的...
文章浏览阅读416次。NoSQL定义:非关系型、分布式、开放源码和具有横向扩展能力的下一代数据库。由c++编写的开源、高性能、无模式的基于分布式文件存储的文档型数据库特点:高性能、高可用性、高扩展性、丰富的查询支持、可替换已完场文档某个指定的数据字段应用场景:社交场景:使用mongodb存储用户信息游戏场景:用户信息,装备积分物流场景:订单信息,订单状态场景操作特点:数据量大;读写操作频繁;价值较低的数据,对事物性要求不高开源、c语言编写、默认端口号6379、key-value形式存在,存储非结构化数据。_nosql
文章浏览阅读1.5k次,点赞3次,收藏2次。Exception in thread "main" redis.clients.jedis.exceptions.JedisConnectionException: Failed to create socket. at redis.clients.jedis.DefaultJedisSocketFactory.createSocket(DefaultJedisSocketFactory.java:110) at redis.clients.jedis.Connection.connect(Conne_redis.clients.jedis.exceptions.jedisconnectionexception: failed to create so
文章浏览阅读6.5k次,点赞3次,收藏12次。readAnyDatabase(在所有数据库上都有读取数据的权限)、readWriteAnyDatabase(在所有数据库上都有读写数据的权限)、userAdminAnyDatabase(在所有数据库上都有管理user的权限)、dbAdminAnyDatabase(管理所有数据库的权限);:clusterAdmin(管理机器的最高权限)、clusterManager(管理和监控集群的权限)、clusterMonitor(监控集群的权限)、hostManager( 管理Server);_mongodb创建用户密码并授权
文章浏览阅读593次。Redis是一个基于内存的键值型NoSQL数据库,在实际生产中有着非常广泛的用处_搭建本地redis
文章浏览阅读919次。Key 的最佳实践[业务名]:[数据名]:[id]足够简短:不超过 44 字节不包含特殊字符Value 的最佳实践:合理的拆分数据,拒绝 BigKey选择合适数据结构Hash 结构的 entry 数量不要超过 1000(默认是 500,如果达到上限则底层会使用哈希表而不是 ZipList,内存占用较多)设置合理的超时时间批量处理的方案:原生的 M 操作Pipeline 批处理注意事项:批处理时不建议一次携带太多命令。Pipeline 的多个命令之间不具备原子性。_redis高级实战
文章浏览阅读1.2k次。MongoDB 递归查询_mongodb数据库 递归
文章浏览阅读1.2k次。通过实际代码例子介绍:如何通过MongoTemplate和MongoRepository操作数据库数据_springboot操作mongodb
文章浏览阅读687次,点赞7次,收藏2次。首先欢迎大家阅读此文档,本文档主要分为三个模块分别是:Redis的介绍及安装、RedisDesktopManager可视化工具的安装、主从(哨兵)模式的配置。_redis 主从配置工具
文章浏览阅读764次。天下武功,无坚不摧,唯快不破!我的名字叫 Redis,全称是 Remote Dictionary Server。有人说,组 CP,除了要了解她外,还要给机会让她了解你。那么,作为开发工程师的你,是否愿意认真阅读此心法抓住机会来了解我,运用到你的系统中提升性能。我遵守 BSD 协议,由意大利人 Salvatore Sanfilippo 使用 C 语言编写的一个基于内存实现的键值型非关系(NoSQL)..._redis 7.2 源码
文章浏览阅读2k次。MongoDB 的增删改查【1】_mongodb $inc