2023. 1. 16. 05:33ㆍData science/Database
NoSQL (Not only SQL, Non-SQL)
Categories of NoSQL Systems
Key-Value based NoSQL: Riak, Redis, Scalaris, Amazon DynamoDB
Document-based NoSQL: MongoDB, CouchDB, OrientDB
Column-base NoSQL: Google BigTable, Cassandra, Hbase, Amazon DynamoDB,
Graph-based NoSQL: Neo4J, InfoGrid, FlockDB
ACID Principle: RDBMS Gold Standard!
- Atomicity: Transactions are completely done or not done at all.
- Consistency: Only valid data to be stored(before and after the transaction, data consistent)
- Isolation: Transactions do not affect each other(without interference)
- Durability: Written data will not be lost
The CAP Theorem
Any distributed data store can provide only two of the following three guarantees:
The BASE Property
Basically Available: DDB is available following CAP Theorem,
Soft state: system's state may change even without transaction execution
Eventual constancy: the system will become consistent eventually.
-> Distributed database, weaker consistency is often acceptable while guaranteeing availability and partition tolerance, where the database will eventually be consistent.
-> Hence, BASE property(NoSQL) is used instead of ACID policy(RDBMS) in a centralized database.
MongoDB
: general purpose, document-based, distributed database for modern application developers for the cloud era
: licensed under the server-side public license(SSPL)
: instead of table joining, $lookup can be used.
: Database Name(case-sensitive), Size(64 bites)
: All documents must be smaller than 16MB
: JSON documents in binary-encoded format(BSON) behind - BSON uses field-value pairs
: type-sensitive and case-sensitive
: each document required _id (unique) - primary key of the collection, objectID(by default), automatically generated
MongoDB CRUD Operations(Create, Read, Update, Delete)
Basic Comparison operators
: $eq, $gt, $gte, $in, $lt(less than), $lte(less than or equal), $ne(not equal), $nin(Non in an array)
Create Operations
: insertOne, insertMany, insert (<document>)
db.product.insertOne({"category": "fruit", "name":"Apple from Uleumgol"});
db.product.insertMany({"category": "fruit", "name":"Apple from Uleumgol"},{"category": "fruit", "name":"Banana from Ulsan"} );
Read Operations
: find( <cond> ), pyears greater than or equal to 3 and age less than or equal to 50.
db.product.find({"pyears":{$gte:3}, {"age": {$lte:50}});
Update Operations
: updateMany(<filter>, <updateAction> , <options>);
IF pyear is greater than or equal to 3<condition-filter>,
THEN set status.finished as 1 and status.ongoing as 0<Update action>.
*$set: sets the value of a field; if the field does not exist, it will be created, and the type can also be changed.
*$unset: remove the field
*$inc: increase the number; if it does not exist, create a field and assign 1.
*$push: add elements(array) if it does not exist(array)
*$addToSet: add elements to the array, but prevent the duplication of array elements.
*$pull: removing element from array
*$pop: removing by order, {$pop:{"key":1}} -> element at the end of array to be removed
*{"upsert": true}: if no document is found, a new document is added. Otherwise, updated.
db.product.updateMany({"pyear": {$gte:3}},
{$set: "status.finished":1, "status.ongoing":0}}
);
Delete Operations
: deleteMany(<filter>, <options>);
db.product.deleteMany(
{"finished": 1, "ongoing":0}
);
Aggregation Framework
$project: reshaping operations
$unwind: split the array into the separated documents with each element
$max, $sum, $avg, etc.
$group: similar to GROUP BY in SQL
Aggregations allow you the combine various commands to query and get various outcomes.
Index
: shortcut and looks at an ordered list with references to the content.
: Without the index, the query does a "collection scan", which means looking through entire documents in the collection
-> Reading with an index is much faster
: BUT, writing operations (insert, updates and deletes) takes much longer - maintaining and recalculating are not free, it takes memories as well -> more details https://www.mongodb.com/developer/products/mongodb/schema-design-anti-pattern-unnecessary-indexes/
-> Thus, better not to set every field as an index. Only for the key values you needed as an index.
: Compound index can be used, and each pair will have index key value
- db.users.createIndex({"age":1, "username":1});
- db.users.find( {"age":21 }).sort({"username":1}): find age 21, and sorted by username(already sorted due to index, regardless of sort, so no need additional computing by using index)
- db.users.find({"age": {"$gte":21, "$lte":30}}): find age between 21 and 30, the username will be sorted already.
- db.users.find({"age": {"$gte":21, "$lte":30}}).sort({"username":1}): In this case, each group of age from 21 ~ 30 will have a sorted username, but the overall order is not sorted by username. Therefore, if you want to sort by the username, you need to set the index order by username and age like {"username":1, "age":1}. Then it will ignore the age group, and the username will sort it.
Schema Design: What to Consider
- Constraints: any database or hardware limitations?
- Access patterns: identify and quantify workload
- Relationship types: which data is related in terms of the application needs
- Cardinality: Cardinalities of the relationship
Schema design patterns
- Bucket Pattern: suitable for time series data - if a certain document reaches the limit, create another document.
- Outlier Pattern: rare instances where a few queries of documents fall outside the normal pattern
- Computed pattern: instead of computing frequently, instead using total count and qty to calculate per document. When the data access pattern is read-intensive
- Extended referencing pattern: many different logical entities, and want to gather them together for a specific function
- Approximation pattern: useful when resource-expensive calculations( time, memory, CPU cycle), but not exact precision required.
Normalization vs Denormalization
In MongoDB, Embedding(Denormalization) and Referencing(Normalization).
- Embedding: storing data in a single document structure, storing in a field or array within a document
- Referencing: the relationship between data by including links or references
Embedding | Referencing |
Fast reads (One query, but slow writing) small sub-documents not change regularly eventual consistency is acceptable (difficult to keep in sync) grow by a small amount (each document takes more space) often need to perform a second query to fetch |
Fast writes (More queries, but fast) large subdocuments volatile data(Change frequently) immediate consistency is necessary grow by a large amount often exclude from the results |
Extended Reference pattern: Hybrid of embedding and referencing
- create an array of subdocuments with the frequently used info, for more info referencing to the actual document
'Data science > Database' 카테고리의 다른 글
Understanding Database 9 - Distributed Database Systems(DDB) (0) | 2023.01.16 |
---|---|
Understanding Database 8 - NoSQL + Python Basic (0) | 2023.01.16 |
Understanding database 6 - Normalization (0) | 2023.01.15 |
Understanding Database 5 - Advanced SQL Queries (Stored Procedure, Trigger, Views) (0) | 2023.01.15 |
Understanding Database 4 - Entity Relationship(ER) Model (0) | 2023.01.15 |